Skip to main content

Clojure: Apply new learning to do things better

Context: I was learning Clojure while actively using it to create a CLI tool at my work. In past I have worked a lot with Python. In this post I am documenting the evolution of certain code as I learn new concepts, improve the readability and refactor the solution and get new doubts. I am also trying to relate this learning process to other content I have come across web.

Problem statement:

I have flat CSV data. Some of the rows are related based on common values(order_id, product_id, order_date).

Task: Consolidate(reduce) orders from multiple rows that have same order_id with a different restructured data.

group-by would take me almost there. But I need a different format of data, Single entry per order_id, all products belonging to it should be under items key.

order_id;order_date;firstname;surname;zipcode;city;countrycode;quantity;product_name;product_id
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;2;Product 1 - black;400412441
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;1;Product 2 - orange;400412445
3481021;20200526;Murray;The skull;70971;Amsterdam;NL;1;Product - blue;400412305
3481139;20200526;Haggis;MacMutton;80912;Hague;NL;5;Product 1 - black;400412441

First attempt:

After reading first few chapters from Brave and True and with lot of trial and error, I got following code to give me the results I wanted:

(defn read-csv
  ""
  [filename]
  (with-open [reader (io/reader filename)]
    (into [] (csv/read-csv reader))))

(defn get-processed-data
  "Given filename of CSV data, returns vector of consolidated maps over common order-id"
  [filename]
  ;; key for order
  (defrecord order-keys [order-id order-date first-name second-name
                         zipcode city country-code quantity product-name product-id])
  (def raw-orders (read-csv filename))
  ;; Drop header row
  (def data-after-removing-header (drop 1 raw-orders))
  ;; Split each row over ; and create vector from result
  (def order-vectors (map #(apply vector (.split (first %) ";")) data-after-removing-header))
  ;; Convert each row vector into order-map
  (def order-maps (map #(apply ->order-keys %) order-vectors))
  ;; Keys that are specific to product item.
  (def product-keys [:product-id :quantity :product-name])
  ;; Keys including name, address etc, they are specific to user and order
  (def user-keys (clojure.set/difference (set (keys (last order-maps))) (set product-keys)))
  ;; Bundle product items belonging to same order. Result is hash-map {"order-id" [{product-item}]
  (def order-items (reduce (fn [result order] (assoc result (:order-id order) (conj (get result (:order-id order) []) (select-keys order product-keys)))) {} order-maps))
  ;; Based on bundled products, create a new consolidated order vector
  (reduce (fn [result [order-id item]] (conj result (assoc (select-keys (some #(if (= (:order-id %) order-id) %) order-maps) user-keys) :items item))) [] order-items))

I am already getting anxious from this code. Firstly, number of variables are completely out of hand. Only last expression is an exception because it is returning the result. Secondly, if I tried to club some of steps, like dropping first row, creating a vector and then create hash map, it looked like:

(def order-maps (map #(apply ->order-keys (apply vector (.split (first %) ";"))) (drop 1 raw-orders)))

Code was becoming more unreadable. I tried to compensate it with the elaborate doc-strings but they aren't that helpful either.

In python, when I tried quickly to write an equivalent:

order_keys = ['order-id', 'order-date', 'first-name', 'second-name',
              'zipcode city', 'country-code', 'quantity', 'product-name', 'product-id']
raw_orders = [dict(zip(order_keys, line.split(';'))) for line in csv_data.split('\n') if 'order' not in line]
order_dict = {}
product_keys = ['quantity', 'product-name', 'product-id']
for row in raw_orders:
    order_id = row[0]
    try:
        order_dict[order_id]['items'].append(row[-3:])
    except KeyError:
        order_dict[order_id] = {'items': row[-3:],
                                'order-details': row[1:-3]}
order_dict.values()

Not the cleanest implementation, but by the end of it I will have consolidated product-items per order in a list with all other details.

And I think this is part of the problem. I was still not fully adapted to the ways of Clojure. I was forcing python's way of thinking into Clojure. It was time to refactor, learn more and clean up the code.

Threading Macros - Revisiting the problem:

I was lost, my google queries became vague, avoid creating variables in clojure, I paired with colleagues to get a second opinion. Meanwhile I thought of documenting this process in #Writing-club. As we were discussing what I would be writing, Punchagan introduced me to concept of threading macros. I was not able understand or use them right away. It took me time to warm up to their brilliance. I started refactoring above code into something like:

(ns project-clj.csv-ops
  (:require [clojure.data.csv :as csv]
            [clojure.java.io :as io]
            [clojure.string :as string]))

(defn remove-header
  "Checks first row. If it contains string order, drops it, otherwise returns everything.
  Returns vector of vector"
  [csv-rows]
  (if (some #(string/includes? % "order") (first csv-rows))
    (drop 1 csv-rows)
    csv-rows))

(defn read-csv
  "Given a filename, parse the content and return vector of vector"
  [filename]
  (with-open [reader (io/reader filename)]
    (remove-header (into [] (csv/read-csv reader :separator \;)))))

(defn get-items
  "Given vector of vector of order hash-maps with common id:
   [{:order-id \"3489066\" :first-name \"Guy\":quantity \"2\" :product-name \"Product 1 - black\"  :product-id \"400412441\" ... other-keys}
    {:order-id \"3489066\" :first-name \"Guy\" :quantity \"1\" :product-name \"Product 2 - orange\" :product-id \"400412445\"}]

   Returns:
   {:order-id \"3489066\"
   :items [{:product-id \"400412441\" ...}
           {:product-id \"400412445\" ...}]}"
  [orders]
  (hash-map
     :order-id (:order-id (first orders))
     :items (vec (for [item orders]
                          (select-keys item [:product-id :quantity :product-name])))))

(defn order-items-map
  "Given Vector of hash-maps with multiple rows for same :order-id(s)
   Returns Vector of hash-maps with single entry per :order-id"
  [orders]
  (->> (vals orders)
       (map #(get-items %) ,,,)
       merge))

(defn user-details-map
  "Given Vector of hash-maps with orders
   Returns address detail per :order-id"
  [orders]
  (->> (vals orders)
       (map #(reduce merge %) ,,,)
       (map #(dissoc % :product-id :quantity :product-name))))

(defn consolidate-orders
  "Given a vector of orders consolidate orders
  Returns vector of hash-maps with :items key value pair with all products belonging to same :order-id"
  [orders]
  (->> (user-details-map orders)
       (clojure.set/join (order-items-map orders) ,,,)
       vector))

(defn format-data
  [filename]
  (defrecord order-keys [order-id order-date first-name second-name
                         zipcode city country-code quantity product-name product-id])
  (->> (read-csv filename)
       (map #(apply ->order-keys %) ,,,)
       (group-by :order-id ,,,)
       consolidate-orders ,,,)))

Doubts/Observations

Although an improvement from my first implementation, refactored code has its own set of new doubts/concerns. Punchagan :

  • How to handle error?
  • How to short circuit execution when one of function/macro fails(using some->>)?
  • Some lines are still very dense.
  • Clojure code has gotten bigger.
  • Should this threading be part of a function and I should write tests for that function?

As readings and anecdotes shared from people in above referred articles suggest, I need to read/write more, continue on the path of Brave and True and not get stuck in loop of advanced beginner.

Second Brain - Archiving: Keeping resources handy

Problem Statement

We are suffering from information overloading, specially from the content behind the walled gardens of Social Media Platforms. The interfaces are designed to keep us engaged by presenting to us latest, popular, and riveting content. It is almost impossible to revisit the source or refer to such a piece of content sometime later. There are many products that are offering exactly that, “Read it later”, moving content off of these feeds and provide a focused interface to absorb things. I think following are some crucial flaws with the model of learning and consuming the content from Social Media Platforms:

  1. Consent: Non consensual customization aka optimization of the feed.
  2. Access: There is no offline mode, content is locked in.
  3. Intent: Designed to trigger a response(like, share, comment) from a user.

In rest of the post I would like to make a case that solving for “Access” would solve the remaining two problems also. When we take out the content from the platform we have the scope of rewriting the rules of engagement.

Knowledge management

As a user I am stuck in a catch 22 situation. Traditional media channels are still catching up, for any developing story their content is outdated. Social media is non stop, buzzing 24x7. How to get just the right amount of exposure and not get burned? How to regain the right of Choice? How to return to what we have read and established it as a fact in our heads? Can we reminisce our truths that are rooted in these posts?

These feeds are infinite. They are the only source of eclectic content, voices, stories, opinions, hot takes. As long as the service is up, the content would exist. We won’t get an offline experience from these services. Memory is getting cheaper everyday, but so is Internet. Social media companies won’t bother with an offline service because they are in complete control of the online experience, they have us hooked. Most importantly, offline experience doesn’t align with their business goals.

I try to keep a local reference of links and quotes from the content I read on internet in org files. It is quite an effort to manage that and maintaining the habit. I have tried to automate the process by downloading or making a note of the links I share with other people or I come across(1, 2). I will take another shot at it and I am thinking more about the problem to narrow down the scope of the project. There are many tools, products and practices to organize the knowledge in digital format. They have varying interfaces, from annotating web pages, papers, books, storing notes, wiki pages, correlate using tags, etc. I strongly feel that there is a need for not just annotating, organizing but also archiving. Archives are essential for organizing anything. And specifically: Archive all your Social Media platforms. Get your own copy of the data: posts, pictures, videos, links. Just have the dump, that way:

  1. No Big brother watching over the shoulder when you access the content. Index it, make it searchable. Tag them, highlight them, add notes, index them also, they can be searched too.
  2. No Censorship: Even if any account you follow gets blocked, deleted, you don’t loose the content.
  3. No link rot: If link shared in post is taken down, becomes private or gets blocked, you will have your own copy of it.

This tool, the Archives, should be personal. Store locally or on your own VPS, just enable users to archive the content in first place. How we process the content is a different problem. It is related and part of the bigger problem of how we consume the content. Ecosystem of plugging the archives with existing products can and will evolve.

Features:

In P.A.R.A method, a system to organize all your digital information, they talk about Archives. It is a passive collection of all the information linked to a project. In our scenario, the archive is a collection of all the information from your social media. In that sense, I think this Archive tool should have following features:

  • Local archive of all your social media feeds. From everyone you follow, archive what they share:
    • Web-pages, blogs, articles.
    • Images.
    • Audios, podcasts.
    • Videos.
  • Complete social media timelines from all your connections is accessible, available, locally. Customize, prioritize, categorize, do what ever you would like to do. Take back the control.
  • Indexed and searchable.

Existing products/methods/projects:

The list of products is every growing. Here are a few references that I found most relevant:

Thank you punchagan for your feedback and review of the post.

Striking a balance between Clobbering and Learning

Getting stuck as an "Advanced Beginner" happens. Specially in cases when we use a new tool or language to deliver a product/project. I have noticed that I approach things with a narrow mindset, I would use the tool or language to deliver what is desired. It will have expected features but its implementation won't be ideal. The process of unlearning these habit is long and often times with a deadline I end up collecting tech debt. Recently I came across some links that talked about this phenomena:

Related Conversations on Internet

There was a big thread on HackerNews around better way to learn CSS(https://news.ycombinator.com/item?id=23868355) and I found this comment relevant to my experience:

They always assume every one learned like them, by trying stuff out all of the time, until they got something working. Then they iterate from project to project, until they sorted out the bad ideas and kept the good ones. With that approach, learning CSS would probably have taken me 10 times as long.

Sure this doesn't teach you everything or makes you a pro in a week, but I always have the feeling people just cobble around for too long and should instead take at least a few days for a more structured learning approach.

Last statement of the comment struck a chord, cloberring has its limitation and it needs to be followed up with reading of fundamental concepts from a book, manual or docs.

Another post that was shared on HackerNews talks about Expert Beginner paradox: https://daedtech.com/how-developers-stop-learning-rise-of-the-expert-beginner/

There’s nothing you can do to improve as long as you keep bowling like that. You’ve maxed out. If you want to get better, you’re going to have to learn to bowl properly. You need a different ball, a different style of throwing it, and you need to put your fingers in it like a big boy. And the worst part is that you’re going to get way worse before you get better, and it will be a good bit of time before you get back to and surpass your current average.

Practices that can help with the process of clobbering and learning:

  1. Tests: unittests gives code a structure. They set basic expectations on how the code should and should not behave. If we maintain a uniform expectation through out the code base, unittests helps maintain a certain uniformity and quality.
  2. Writing documentation: For me this is like rubber duck debugging. It gives an active feedback on what are the deliverable, supported features, limitations, and upcoming features.
  3. Pairing with colleagues over the concepts and implementation. Walking through the code and explaining it to colleagues helps me identify sections of code that make me uncomfortable. Where am I weak and where should I focus to improve.
  4. Though similar to pairing, Code Reviews have their own importance and value.

These practices won't replace the need of reading Docs or Book, but they would certainly give you good quality code and keep your tech debt in check.

Clojure, hash-map, keys, keyword

tldr; Simple strings can be used as key to a hash-map. Either use get to lookup for them. Or convert them into keyword using keyword method.

hash-map are an essential Data Structures of Clojure. They support an interesting feature of keyword that can really enhance lookup experience in Clojure hash-map.

;; Placeholder, improve it
user=> (def languages {:python "Everything is an Object."
                       :clojure "Everything is a Function."
                       :javascript "Whatever you would like it to be."})
;; To lookup in map
user=> (:python languages)
"Everything is an Object."
user=> (get languages :ruby)
nil

Syntax is easy to understand and easy to follow. So far so good. I started using it here and there. At a point I came to a situation where I had to do a lookup in a map, using a variable:

user=> (def brands {:nike "runnin shoes"
  #_=> :spalding "basketball"
  #_=> :yonex "badminton"
  #_=> :wilson "tennis racquet"
  #_=> :kookaburra "cricket ball"})

(def brand-name "yonex")

Because we have used keyword in map brands, we can't user value stored in variable brand-name directly to do a lookup in the map. I tried silly things like :str(brand-name) (results in Execution error ) or :brand-name (returns nil ). I got confused on how to do this. Almost all examples in docs were using keyword. I tried a few things and understood that we can indeed use string as key and to fetch the value use get function:

user=> (def brands {"nike" "runnin shoes"
  #_=> "spalding" "basketball"
  #_=> "yonex" "badminton"
  #_=> "wilson" "tennis racquet"
  #_=> "kookaburra" "cricket ball"})
#'user/brands
user=> (get brands brand-name)
"badminton"

While using keyword has simpler syntax, at times when I am using external APIs it is easier to work with string or lookup for a key in hash-map using variable. In python I do it all the time. Though I am not sure if using string as key is the recommended way.

Update <2020-07-08 Wed>

punch and I were discussing this post and he mentioned that in lisp we can use keyword as method to convert string into a keyword. After a quick search, TIL, indeed we can keyword a variable. The method converts string into a equivalent keyword:

user=> (def brands {"nike" "runnin shoes"
  #_=> "spalding" "basketball"
  #_=> "yonex" "badminton"
  #_=> "wilson" "tennis racquet"
  #_=> "kookaburra" "cricket ball"})
#'user/brands
user=> (keyword brand-name)
:yonex
user=> ((keyword brand-name) brands)
"badminton"

Clojure Command Line Arguments II

I wrote a small blog on parsing command line earlier this week. The only comment I got on it from punch was:

I'd like to learn from this post why command-line-args didn't work. What is it, actually, the command-line-args thingy, etc.?

Those are good questions. I didn't know answer to them. As I talk in post I wanted a simple solution to parsing args and another confusing experience brought me back to the same question, "Why args behave the way they do and What are *command-line-args*". I still don't have answer to them. In this post I am documenting two things, for my own better understanding. One around reproducing the issue of jar not able to work with *command-line-args*. Second one around limited features of sequence that are supported by args

Reproducing behaviour of jar (non)handling of *command-line-args*

We create a new project using lein:

$ lein new app cli-args
Generating a project called cli-args based on the 'app' template.
$ cd cli-args/
$ lein run
Hello, World!

We edit src/cli_args/core.clj to print args and *command-line-args*

cat <<EOF > src/cli_args/core.clj
(ns cli-args.core
  (:gen-class))

(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "Printing args.." args)
  (println "Printing *command-line-args*" *command-line-args*))
EOF

$ lein run optional arguments

Now we create jar using lein uberjar

$ lein uberjar
Compiling cli-args.core
Created target/uberjar/cli-args-0.1.0-SNAPSHOT.jar
Created target/uberjar/cli-args-0.1.0-SNAPSHOT-standalone.jar
$ cd target/uberjar/
$ java -jar cli-args-0.1.0-SNAPSHOT-standalone.jar testing more optional arguments
Printing args.. (testing more optional arguments)
Printing *command-line-args* nil

Clojure is able to handle *command-line-args* but java is not. That narrows down the problem and can possibly lead to explanation on why it is happening(Maybe in another post).

Sequence features supported by args

I noticed another anomaly with args. I was passing couple of arguments and I noticed that it doesn't support get method.

cat <<EOF > src/cli_args/core.clj
(ns cli-args.core
  (:gen-class))

(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "first argument" (first args))
  (println "second argument" (second args))
  (println "third argument" (get args 3)))
EOF

This is what I noticed as I tried different inputs:

$ lein run
first argument nil
second argument nil
third argument nil
$ lein run hello world
first argument hello
second argument world
third argument nil
$ lein run hello world 3rd argument
first argument hello
second argument world
third argument nil

The get method doesn't work. I printed type for args and it is clojure.lang.ArraySeq. For my case, I "managed" by using last and that gave me what I wanted. Still, I am running out of options and I would have to either dig deeper to understand args or fall back to using a library(tools.cli).