Skip to main content

Clojure: Apply new learning to do things better

Context: I was learning Clojure while actively using it to create a CLI tool at my work. In past I have worked a lot with Python. In this post I am documenting the evolution of certain code as I learn new concepts, improve the readability and refactor the solution and get new doubts. I am also trying to relate this learning process to other content I have come across web.

Problem statement:

I have flat CSV data. Some of the rows are related based on common values(order_id, product_id, order_date).

Task: Consolidate(reduce) orders from multiple rows that have same order_id with a different restructured data.

group-by would take me almost there. But I need a different format of data, Single entry per order_id, all products belonging to it should be under items key.

order_id;order_date;firstname;surname;zipcode;city;countrycode;quantity;product_name;product_id
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;2;Product 1 - black;400412441
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;1;Product 2 - orange;400412445
3481021;20200526;Murray;The skull;70971;Amsterdam;NL;1;Product - blue;400412305
3481139;20200526;Haggis;MacMutton;80912;Hague;NL;5;Product 1 - black;400412441

First attempt:

After reading first few chapters from Brave and True and with lot of trial and error, I got following code to give me the results I wanted:

(defn read-csv
  ""
  [filename]
  (with-open [reader (io/reader filename)]
    (into [] (csv/read-csv reader))))

(defn get-processed-data
  "Given filename of CSV data, returns vector of consolidated maps over common order-id"
  [filename]
  ;; key for order
  (defrecord order-keys [order-id order-date first-name second-name
			 zipcode city country-code quantity product-name product-id])
  (def raw-orders (read-csv filename))
  ;; Drop header row
  (def data-after-removing-header (drop 1 raw-orders))
  ;; Split each row over ; and create vector from result
  (def order-vectors (map #(apply vector (.split (first %) ";")) data-after-removing-header))
  ;; Convert each row vector into order-map
  (def order-maps (map #(apply ->order-keys %) order-vectors))
  ;; Keys that are specific to product item.
  (def product-keys [:product-id :quantity :product-name])
  ;; Keys including name, address etc, they are specific to user and order
  (def user-keys (clojure.set/difference (set (keys (last order-maps))) (set product-keys)))
  ;; Bundle product items belonging to same order. Result is hash-map {"order-id" [{product-item}]
  (def order-items (reduce (fn [result order] (assoc result (:order-id order) (conj (get result (:order-id order) []) (select-keys order product-keys)))) {} order-maps))
  ;; Based on bundled products, create a new consolidated order vector
  (reduce (fn [result [order-id item]] (conj result (assoc (select-keys (some #(if (= (:order-id %) order-id) %) order-maps) user-keys) :items item))) [] order-items))

I am already getting anxious from this code. Firstly, number of variables are completely out of hand. Only last expression is an exception because it is returning the result. Secondly, if I tried to club some of steps, like dropping first row, creating a vector and then create hash map, it looked like:

(def order-maps (map #(apply ->order-keys (apply vector (.split (first %) ";"))) (drop 1 raw-orders)))

Code was becoming more unreadable. I tried to compensate it with the elaborate doc-strings but they aren't that helpful either.

In python, when I tried quickly to write an equivalent:

order_keys = ['order-id', 'order-date', 'first-name', 'second-name',
	      'zipcode city', 'country-code', 'quantity', 'product-name', 'product-id']
raw_orders = [dict(zip(order_keys, line.split(';'))) for line in csv_data.split('\n') if 'order' not in line]
order_dict = {}
product_keys = ['quantity', 'product-name', 'product-id']
for row in raw_orders:
    order_id = row[0]
    try:
	order_dict[order_id]['items'].append(row[-3:])
    except KeyError:
	order_dict[order_id] = {'items': row[-3:],
				'order-details': row[1:-3]}
order_dict.values()

Not the cleanest implementation, but by the end of it I will have consolidated product-items per order in a list with all other details.

And I think this is part of the problem. I was still not fully adapted to the ways of Clojure. I was forcing python's way of thinking into Clojure. It was time to refactor, learn more and clean up the code.

Threading Macros - Revisiting the problem:

I was lost, my google queries became vague, avoid creating variables in clojure, I paired with colleagues to get a second opinion. Meanwhile I thought of documenting this process in #Writing-club. As we were discussing what I would be writing, Punchagan introduced me to concept of threading macros. I was not able understand or use them right away. It took me time to warm up to their brilliance. I started refactoring above code into something like:

(ns project-clj.csv-ops
  (:require [clojure.data.csv :as csv]
	    [clojure.java.io :as io]
	    [clojure.string :as string]))

(defn remove-header
  "Checks first row. If it contains string order, drops it, otherwise returns everything.
  Returns vector of vector"
  [csv-rows]
  (if (some #(string/includes? % "order") (first csv-rows))
    (drop 1 csv-rows)
    csv-rows))

(defn read-csv
  "Given a filename, parse the content and return vector of vector"
  [filename]
  (with-open [reader (io/reader filename)]
    (remove-header (into [] (csv/read-csv reader :separator \;)))))

(defn get-items
  "Given vector of vector of order hash-maps with common id:
   [{:order-id \"3489066\" :first-name \"Guy\":quantity \"2\" :product-name \"Product 1 - black\"  :product-id \"400412441\" ... other-keys}
    {:order-id \"3489066\" :first-name \"Guy\" :quantity \"1\" :product-name \"Product 2 - orange\" :product-id \"400412445\"}]

   Returns:
   {:order-id \"3489066\"
   :items [{:product-id \"400412441\" ...}
	   {:product-id \"400412445\" ...}]}"
  [orders]
  (hash-map
     :order-id (:order-id (first orders))
     :items (vec (for [item orders]
			  (select-keys item [:product-id :quantity :product-name])))))

(defn order-items-map
  "Given Vector of hash-maps with multiple rows for same :order-id(s)
   Returns Vector of hash-maps with single entry per :order-id"
  [orders]
  (->> (vals orders)
       (map #(get-items %) ,,,)
       merge))

(defn user-details-map
  "Given Vector of hash-maps with orders
   Returns address detail per :order-id"
  [orders]
  (->> (vals orders)
       (map #(reduce merge %) ,,,)
       (map #(dissoc % :product-id :quantity :product-name))))

(defn consolidate-orders
  "Given a vector of orders consolidate orders
  Returns vector of hash-maps with :items key value pair with all products belonging to same :order-id"
  [orders]
  (->> (user-details-map orders)
       (clojure.set/join (order-items-map orders) ,,,)
       vector))

(defn format-data
  [filename]
  (defrecord order-keys [order-id order-date first-name second-name
			 zipcode city country-code quantity product-name product-id])
  (->> (read-csv filename)
       (map #(apply ->order-keys %) ,,,)
       (group-by :order-id ,,,)
       consolidate-orders ,,,)))

Doubts/Observations

Although an improvement from my first implementation, refactored code has its own set of new doubts/concerns. Punchagan :

  • How to handle error?
  • How to short circuit execution when one of function/macro fails(using some->>)?
  • Some lines are still very dense.
  • Clojure code has gotten bigger.
  • Should this threading be part of a function and I should write tests for that function?

As readings and anecdotes shared from people in above referred articles suggest, I need to read/write more, continue on the path of Brave and True and not get stuck in loop of advanced beginner.