Clojure: Apply new learning to do things better
Context: I was learning Clojure while actively using it to create a CLI tool at my work. In past I have worked a lot with Python. In this post I am documenting the evolution of certain code as I learn new concepts, improve the readability and refactor the solution and get new doubts. I am also trying to relate this learning process to other content I have come across web.
Problem statement:
I have flat CSV data. Some of the rows are related based on common values(order_id
, product_id
, order_date
).
Task: Consolidate(reduce
) orders from multiple rows that have same order_id
with a different restructured data.
group-by
would take me almost there. But I need a different format of data, Single entry per order_id
, all products belonging to it should be under items
key.
order_id;order_date;firstname;surname;zipcode;city;countrycode;quantity;product_name;product_id
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;2;Product 1 - black;400412441
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;1;Product 2 - orange;400412445
3481021;20200526;Murray;The skull;70971;Amsterdam;NL;1;Product - blue;400412305
3481139;20200526;Haggis;MacMutton;80912;Hague;NL;5;Product 1 - black;400412441
First attempt:
After reading first few chapters from Brave and True and with lot of trial and error, I got following code to give me the results I wanted:
(defn read-csv
""
[filename]
(with-open [reader (io/reader filename)]
(into [] (csv/read-csv reader))))
(defn get-processed-data
"Given filename of CSV data, returns vector of consolidated maps over common order-id"
[filename]
;; key for order
(defrecord order-keys [order-id order-date first-name second-name
zipcode city country-code quantity product-name product-id])
(def raw-orders (read-csv filename))
;; Drop header row
(def data-after-removing-header (drop 1 raw-orders))
;; Split each row over ; and create vector from result
(def order-vectors (map #(apply vector (.split (first %) ";")) data-after-removing-header))
;; Convert each row vector into order-map
(def order-maps (map #(apply ->order-keys %) order-vectors))
;; Keys that are specific to product item.
(def product-keys [:product-id :quantity :product-name])
;; Keys including name, address etc, they are specific to user and order
(def user-keys (clojure.set/difference (set (keys (last order-maps))) (set product-keys)))
;; Bundle product items belonging to same order. Result is hash-map {"order-id" [{product-item}]
(def order-items (reduce (fn [result order] (assoc result (:order-id order) (conj (get result (:order-id order) []) (select-keys order product-keys)))) {} order-maps))
;; Based on bundled products, create a new consolidated order vector
(reduce (fn [result [order-id item]] (conj result (assoc (select-keys (some #(if (= (:order-id %) order-id) %) order-maps) user-keys) :items item))) [] order-items))
I am already getting anxious from this code. Firstly, number of variables are completely out of hand. Only last expression is an exception because it is returning the result. Secondly, if I tried to club some of steps, like dropping first row, creating a vector and then create hash map, it looked like:
(def order-maps (map #(apply ->order-keys (apply vector (.split (first %) ";"))) (drop 1 raw-orders)))
Code was becoming more unreadable. I tried to compensate it with the elaborate doc-strings
but they aren't that helpful either.
In python, when I tried quickly to write an equivalent:
order_keys = ['order-id', 'order-date', 'first-name', 'second-name',
'zipcode city', 'country-code', 'quantity', 'product-name', 'product-id']
raw_orders = [dict(zip(order_keys, line.split(';'))) for line in csv_data.split('\n') if 'order' not in line]
order_dict = {}
product_keys = ['quantity', 'product-name', 'product-id']
for row in raw_orders:
order_id = row[0]
try:
order_dict[order_id]['items'].append(row[-3:])
except KeyError:
order_dict[order_id] = {'items': row[-3:],
'order-details': row[1:-3]}
order_dict.values()
Not the cleanest implementation, but by the end of it I will have consolidated product-items per order
in a list with all other details.
And I think this is part of the problem. I was still not fully adapted to the ways of Clojure. I was forcing python's way of thinking into Clojure. It was time to refactor, and clean up the code.
Threading Macros - Revisiting the problem:
I was lost, my google queries became vague, avoid creating variables in clojure
, I paired with colleagues to get a second opinion. Meanwhile I thought of documenting this process in #Writing-club
. As we were discussing what I would be writing, Punchagan introduced me to concept of threading macros
. I was not able understand or use them right away. It took me time to warm up to their brilliance. I started refactoring above code into something like:
(ns project-clj.csv-ops
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]
[clojure.string :as string]))
(defn remove-header
"Checks first row. If it contains string order, drops it, otherwise returns everything.
Returns vector of vector"
[csv-rows]
(if (some #(string/includes? % "order") (first csv-rows))
(drop 1 csv-rows)
csv-rows))
(defn read-csv
"Given a filename, parse the content and return vector of vector"
[filename]
(with-open [reader (io/reader filename)]
(remove-header (into [] (csv/read-csv reader :separator \;)))))
(defn get-items
"Given vector of vector of order hash-maps with common id:
[{:order-id \"3489066\" :first-name \"Guy\":quantity \"2\" :product-name \"Product 1 - black\" :product-id \"400412441\" ... other-keys}
{:order-id \"3489066\" :first-name \"Guy\" :quantity \"1\" :product-name \"Product 2 - orange\" :product-id \"400412445\"}]
Returns:
{:order-id \"3489066\"
:items [{:product-id \"400412441\" ...}
{:product-id \"400412445\" ...}]}"
[orders]
(hash-map
:order-id (:order-id (first orders))
:items (vec (for [item orders]
(select-keys item [:product-id :quantity :product-name])))))
(defn order-items-map
"Given Vector of hash-maps with multiple rows for same :order-id(s)
Returns Vector of hash-maps with single entry per :order-id"
[orders]
(->> (vals orders)
(map #(get-items %) ,,,)
merge))
(defn user-details-map
"Given Vector of hash-maps with orders
Returns address detail per :order-id"
[orders]
(->> (vals orders)
(map #(reduce merge %) ,,,)
(map #(dissoc % :product-id :quantity :product-name))))
(defn consolidate-orders
"Given a vector of orders consolidate orders
Returns vector of hash-maps with :items key value pair with all products belonging to same :order-id"
[orders]
(->> (user-details-map orders)
(clojure.set/join (order-items-map orders) ,,,)
vector))
(defn format-data
[filename]
(defrecord order-keys [order-id order-date first-name second-name
zipcode city country-code quantity product-name product-id])
(->> (read-csv filename)
(map #(apply ->order-keys %) ,,,)
(group-by :order-id ,,,)
consolidate-orders ,,,)))
Doubts/Observations
Although an improvement from my first implementation, refactored code has its own set of new doubts/concerns. Punchagan :
- How to handle error?
- How to short circuit execution when one of function/macro fails(using
some->>
)? - Some lines are still very dense.
- Clojure code has gotten bigger.
- Should this threading be part of a function and I should write tests for that function?
As readings and anecdotes shared from people in above referred articles suggest, I need to read/write more, continue on the path of Brave and True and not get stuck in loop of advanced beginner.