Skip to main content

Resolution / सलटारा

Continuing from the last post. The fact that Python interpreter didn't catch the regex pattern in tests but threw compile error on staging environment was very unsettling. Personally I knew that I am missing something on my part and I was very reluctant on blaming the language itself. Turns out, I was correct 🙈

I was looking for early feedback and comments on the post and Punch was totally miffed by this inconsistency and specially the conclusion I was coming to

Umm, I'm not sure I'm happy with the suggestion that I should check every regex I write, with an external website tool to make sure the regex itself is a valid one. "Python interpreter ka kaam main karoon abhi?" :frown:

He quickly compared the behavior between JS(compiler complianed) and Python(did nothing). Now that I had full attention from him there was no more revisiting this at some later stage. We started digging. He confirmed that the regex failed with Python2 but not with Python3.11 with a simple command

docker run -it python:3.11 python -c 'import re; p = r"\s*+"; re.compile(p); print("yo")'

I followed up on this and that was the mistake on my part. My local python setup was 3.11, I was using this to run my tests and staging environment was using 3.10. When I ran my tests within containerized setup, similar to what we used on staging, Python interpreter rightly caught faulty regex:

re.error: multiple repeat at position 11

Something changed between python 3.11 and 3.10 release. I looked at the release logs and noticed this new feature atomic grouping and possessive quantifier. I am not sure how to use it or what it does 1. But with this feature, python 3.11 regex pattern: r"\s*+" is a valid one. I was using this to run tests locally. On staging we had python 3.10, and with it interpreter threw an error.

Lesson learned:

I was testing things by running things locally. Don't do that. Setup a stack and get is as close to the staging and production as possible. And always, always run tests within it.

You pull an end of a mingled yarn,

to sort it out,

looking for a resolution but there is none,

it is just layers and layers.

Assumptions / घारणा

In the team I started working with recently, I am advocating for code reviews, best practices, tests and using CI/CD. I feel in Python ecosystem it is hard to push reliable code without these practices.

I was assigned an issue to find features from text and had to extract money figures. I started searching for existing libraries (humanize, spacy, numerize, advertools, price-parser). These libraries always hit an edge case with my requirements. I drafted an OpenAI prompt and got a decent regex pattern that covered most of my requirements. I made a few improvements to the pattern and wrote unit and integration tests to confirm that the logic was covering everything I wanted. So far so good. I got the PR approved, merged and deployed. Only to find that the code didn't work and it was breaking on staging environment.

As the prevailing wisdom goes around regular expressions based on 1997 chestnut

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

I have avoided regular expressions, and here I am.

I was getting following stacktrace:

File "/lib/number_parser.py", line 19, in extract_numbers
pattern = re.compile("|".join(monetary_patterns))
File "/usr/local/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/local/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/local/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/local/lib/python3.10/sre_parse.py", line 672, in _parse
raise source.error("multiple repeat",	
re.error: multiple repeat at position 11

Very confusing. Seems there was an issue with my regex pattern. But the logic worked, I tested it. The pattern would fail for a certain type of input and work for others. What gives? I shared the regex pattern with a colleague and he promptly identified the issue, there was a redundant + in my pattern. I wanted to look for empty spaces and I had used a wrong pattern r'\s*+'.

I understand that Python is a interpreted and dynamically typed and that's why I wrote those tests(unit AND integration), containerized the application to avoid the wat. And here I was, despite all the measures, preaching best practices and still facing such a bug for the first time. I assumed that interpreter will do its job and complain about the buggy regex pattern and my tests would fail. Thanks to Punch we further dug into this behavior here.

A friend of mine, Tejaa had shared a regex resource: https://regex-vis.com/, it is a visual representation (state diagram of sorts) of the grammar. I tested my faulty regex pattern with \s*+ and the site reported: Error: nothing to repeat. This is better, error is similar with what I was noticing in my stack trace. I also tested the fixed pattern and the site showed a correct representation of what I wanted.

Always confirm your assumptions.

assumption is the mother of all mistakes (fuckups)

Pragmatic Zen of Python

I gave a talk at PyDelhi 2023 and got good feedback from the participants. It was a nice opportunity to meet lots of community member after a long time. These in person exchange of ideas and conversations are very essential for personal growth and learning. I will continue to do more in 2024 🤞🏾

2022....

Here is a wishlist for 2022. I have made similar lists in past, but privately. I didn't do well on following up on them. I want to change that. I will keep the list small and try to make them specific.

  • Read More

    I didn't read much in 2021. I have three books right now with me: Selfish Gene by Richard dawkins, Snow Crash by Neal Stephenson, Release It! by Michael T. Nygard. I want to finish them. And possibly add couple more to this list. Last year, I got good feedback on how to improve and grow as a programmer. Keeping that in mind I will pick up some/a technical book. I have never tried before. Lets see how that goes.

  • Monitoring air quality data of Bikaner

    When I was in Delhi back in 2014 I was really miffed by deteriorating air quality in Delhi. There was a number attached to it, AQI, it was registered by sensors in different locations. I felt it didn't reflect the real number of air quality for commuters like me. I would daily commute from office on foot and I would see lot of other folks stuck in traffic breathing in the same pollution. I tried to do something back then, didn't get too much momentum and stopped.

    Recently with help of Shiv I am mentally getting ready to pick it up, again. In Bikaner we don't have official, public AQI number (1, 2, 3). I would like that to change. I want to start small, couple of boxes that could start collecting data. And add more features to it and increase the fleet to cover more areas in the City.

  • Write More:

    I was really bad with writing things in 2021. Fix it. I want to write 4 to 5 good, thought out, well written posts. I also have lot of drafts, I want to finish writing them. I think being more consistent with writing club on wednesdays would be helpful.

  • Personal Archiving:

    Again a long time simmering project. I want to put together scaffolding that can hold this. I recently read this post, the author talks about different services he runs on his server. I am getting a bit familiar with containers, docker-compose, ansible. And this post has given me some new inspirations on taking some small steps. I think the target around this project are still vague and not specific. I want some room here. To experiment, get comfortable, try existing tools.

Review of AI-ML Workshop

In this post I am reflecting on a Artificial language, Machine learning workshop we conducted at NMD, what worked, what didn't work, and how to prepare better. We(Nandeep and I) have been visiting NID from past few years. We are called when students from NMD are doing their Diploma project and need technical guidance with their project ideas. We noticed that because of time constraints often students didn't understand the core concepts of the tools (algorithm, library, software) they would use. Many students didn't have programming background but their projects needed basic, if not advanced skill sets to get a simple demo working. As we wrap up we would always reflect on the work done from students, how they fared and wished we had longer time to dig deeper. Eventually that happened, we got a chance to conduct a two week workshop on Artificial Intelligence and Machine Learning in 2021.

What did we plan:

We knew that students will be from broad spectrum of backgrounds: media (journalism, digital, print), industrial design, architecture, fashion and engineering. All of them had their own systems(laptops), with Windows(7 or 10) or MacOS. We were initially told that it would be five days workshop. We managed to spread it across 2 weeks with half a day of workshop everyday. We planned following rough structure for the workshop:

  1. Brief introduction to the subject and tools we would like to use.
  2. Basic concepts of programming, Jupyter Notebooks, their features and limitations.🔖
  3. Handling Data: reading, changing, visualizing, standard operations.🔖
  4. Introduction to concepts of Machine learning. Algorithms, data-sets, models, training, identifying features.
  5. Working with different examples and applications, face recognition, speech to text, etc.

How did the workshop go, observations:

After I reached campus I got more information on logistics around the workshop. We were to run the workshop for two weeks, for complete day. We scrambled to make adjustments. We were happy that we had more time in hand. But with time came more expectations and we were under prepared. We decided to add assignments, readings(papers, articles) and possibly a small project by every student to the workshop schedule.

On first day after introduction, Nandeep started with Blockly and concepts of programming in Python. In second half of day we did a session around student's expectations from the workshop. We ended the day with small introduction around data visualization<link to gurman's post on Indian census and observation on spikes on 5 and 10 year age groups>. For assignment we asked everyone to document a good information visualization that they had liked and how it helped improve their understanding of the problem.

For Second day we covered basics of Python programming. I was hosting Jupyter hub for the class on my system, session was hands on and all the students were asked to follow what we were doing, experiment around and ask questions/doubts. It was a slow start, it is hard to introduce abstract concepts of programming and relating them to applications in AI/ML domain. In second half we did a reading of chapter from Society of Mind<add-link-here> followed it with group discussion. We didn't follow up on first day's assignment which we should have done.

On third day we tried to pick pace to get students closer to applications of AI/ML. We started with concepts of Lists, Arrays, slicing of arrays leading up to how an image is represented in Array. By lunch time we were able to walk everyone through the process of cropping on a face in the image using concepts of array slicing. In every photo editing app this is a basic feature but we felt it was a nice journey for the students on making sense of how it is done, behind the scene. In afternoon session we continued with more image filters, what are algorithms behind them. PROBLEM: We had hard time explaining why imshow by default would show gray images also colored. We finished the day by assigning all students random image processing algorithms from scikit-learn. Task was they would try them and understand how they worked. By this time we started setting up things on student's computer so that they could experiment with things on their own. Nandeep had a busy afternoon setting up development environment on everyone's laptop.

Next day, fourth day, for the first half, students were asked to talk about the algorithm they were assigned, demo it, explain them to other students. Idea was to get everyone comfortable with reading code, understand what is needed for doing more complex tasks like face detection, recognition, object detection etc. In afternoon session we picked up "Audio". Nandeep introduced them to LibROSA. He walked them through on playing a beat<monotone?> on their system. How they could load any audio file and mix them up, create effects etc. At this point some students were still finishing up with third days assignment while others were experimenting with Audio library. Things got fragmented. Meanwhile in parallel we kept answering questions from students, resolve dependencies for their development setup. For assignment we gave every student list of musical instruments and asked them to analyse them, identify their unique features, how to compare these audios to each other.

On fifth day we picked up text and make computer understand the text. We introduced them to concepts like features, classification. We used NLTK library. We showed them how to create simple Naive Bayes text classification. We created a simple dataset, label it, we created a data pipeline to process the data, clean it up, extract feature and "train" the classifier. We were covering things that are fundamentals of Machine learning. For weekend we gave them an assignment on text summarizing. We gave them pointers on existing library and how they work. There are different algorithms. Task was to experiment with these algorithm, what were their limitations. Can they think of something that could improve them? Can they try to implement their ideas.

WEEK 1 ENDED HERE

We were not keen on mandatory student attendance and participation. This open structure didn't give us good control. Students would be discussing things, sharing their ideas, helping each other with debugging. We wanted that to happen but we were not able to balance student collaboration, peer to peer learning and introducing new and more complicated concepts/examples.

Over the weekend I chose a library that could help us introduce basic concepts of computer vision, face detection and face recognition. BUT I didn't factor in how to set it up on Windows system. The library depended on DLib. In morning session we introduced concept of Haar cascade (I wanted to include a reading session around the paper). We showed them a demo of how it worked. In afternoon students were given time to try things themselves, ask questions. Nandeep particularly had a hard time setting up the library on students system. Couple of student followed up on the weekend project. They had fixed a bug in a library to make it work with Hindi.

On Tuesday we introduced them to Speech recognition and explained some core concepts. We setup a demo around Mozilla Deep Search. The web microphone demo doesn't work quite that well in open conversation scenario. There was lot of cross talking and further my accent was not helpful. The example we showed was Web based so we also talked about web application, cloud computing, client-server model. Afternoon was again an open conversation on the topic and students were encouraged to try things by themselves.

On Wednesday we covered different AI/ML components that powers modern smart home devices like Alexa, Google Home, Siri. We broke down what it take for Alexa to tell a joke when asked to. What are onboard systems and the cloud components of such a device. The cycle starts with mics on the device that are always listening for Voice activity detection. Once they get activated they would record audio, stream it to cloud to get text from the speech. Further intent classification is done using NLU, searching for answer and finally we the consumer gets the result. We showed them different libraries, programs, third-party solutions that can be used to create something similar on their own.

We continued the discussion next day on how to run these programs on their own. We stepped away from Jupyter and showed how to run python scripts. Based on earlier lesson around face recognition some students were trying to understand how to detect emotions from a face. This was a nice project. We walked the students on how to search for existing project, paper on the same. We found a well maintained Github project. We followed its README they maintainer already had a trained model. We were able to move quickly and get it working. I felt this was a great exercise. We were able to move quickly and build on top of existing examples. In afternoon we did a reading on Language and Machines section of this blog:

Let's not forget that what has allowed us to create the simultaneously beloved and hated artificial intelligence systems during the last decade, have been advancements in computing, programming languages and mathematics, all of those branches of thought derived from conscious manipulation of language. Even if they may present scary and dystopic consequences, intelligent artificial systems will very rapidly make the quality of our lives better, just as the invention of the wheel, iron tools, written language, antibiotics, factories, vehicles, airplanes and computers. As technology evolves, our conceptions of languages and their meanings should evolve with it.

On last day we reviewed some of the things we covered. Got feedback from students. We talked about how we have improvised the workshop based on inputs from students and Jignesh. We needed to prepare better. Students wished they were more regular and had more time to learn. I don't think we will ever had enough time, this would always be hard. Some students still wanted us to cover more things. Someone asked us to follow up on handling data and info visualization. We had talked briefly about it on day one. So we resumed with that, walked them through with the exercise fetching raw data, cleaning it, plotting and finding stories hidden in them.

Resolve inconsistent behaviour of a Relay with an ESP32

I have worked with different popular IoT boards, arduino, esp32, edison<links>, RaspberryPi. Sometimes trying things myself, other times helping others. I can figure out things on the code side of a project but often I will get stuck debugging the hardware. This has specially blocked me from doing anything beyond basic hello world examples.

A couple of months ago, I picked up an esp32 again. I was able to source some components from a local shop. ESP32, Battery, a charging IC, a Relay, LEDs different resistors and jumper cables. I started off with the simple LED blinking example. I Got that working fairly quickly. Using examples I was able to connect ESP32 to a wifi and control the LED via a static web-page<link-to-example>. Everything was working, documentation, hardware, examples. This was promising and I was getting excited.

Next, a friend of mine, Shiv, he is good with putting together electrical components, brought an LED light bulb. And we thought, lets control it remotely with esp32. We referred to Relay switch connections, connected jumper cables and confirmed that with esp32 the relay LED light bulb would flip when we controlled the light over wifi. It was not working consistently, but it was working to a certain level. Shiv quickly bundled everything inside the bulb. He connected power supply to charging IC and powered the ESP32. We connected relay properly. Everything was neat, clean, packed and ready to go. We plugged in the Bulb and waited for ESP32 to connect to Wifi. It was on, I was able to refresh the webpage that controlled the light/led. So far so good. We tried switching on the LED bulb, nothing. We tried couple of times, nothing. On the webpage I could see that state of light was toggling. I didn't have access to serial monitor. I could not tell if everything on ESP32 was working. And I thought to myself, sigh, here we go again.

We disassembled everything, laid all components flat on the table. I connected ESP32 to my system with USB cable. Shiv got a multimeter. We confirmed that pins we were connecting to were becoming HIGH and LOW. There was a LED on Relay, it was flipping correctly. We also heard a click sound in Relay when we toggled the state. And still the behaviour of LED light was not consistent. Either it won't turn on. Or if it turned on it won't turn off. Rebooting ESP32 would fix things briefly and after couple of iterations it would be completely bricked. In logs everything was good, consistent the way it should be. But it was not. I gave up and left.

Shiv on the other hand kept trying different things. He noticed that although the PIN we connect to would in theory go HIGH and LOW. But the LOW didn't mean 0. He was still getting some measurement even when ping was LOW. He added resistors between the ESP32 pin and Relay input. It still didn't bring the LOW to zero. He read a bit more. *AND he added a LED between ESP32 and Relay: Voilà*.

LED was perfect. It was behaving as a switch. It takes that 3.3V and use it. Anything less, which is what we had when we put ESP32 pin to LOW, LED would eat it up and not let anything pass through. And connected on other end, Relay started blinking, clicking happily. What a nice find. Shiv again packed everything together. When we met again the next day he showed me the bulb and said, "want to try?". I was skeptical, I opened the webpage, clicked on "ON" and the bulb turned on. Off, the bulb was off. I clicked hurriedly to break it. It didn't break. It kept working, consistently, every single time.

Challenges involved in archiving a webpage

Last year as I was picking up on ideas around building a personal archival system, I put together small utility that would download and archive a webpage. As I kept thinking on the subject I realized it has very significant shortcomings:

  1. In the utility I am parsing the content of page, finding different kind of urls(img, css, js) to recursively fetch the static resources in the end I will have the archive of page. But there is more to that in how a page gets rendered. Browsers parses HTML and all the resources to finally render the page. The program we write has to be equally capable or else the archive won't be complete.
  2. Whenever I have been behind a proxy in college campuses I have noticed reCAPTCHA would pop up saying something in line that suspicious activity is noticed from my connection. With this automated archival system, how to avoid it? I have a feeling that if the system triggers the reCAPTCHA activation, for automated browsing of a page, the system will be locked out and won't have any relevant content of the page. My concern is, I don't know enough on how and when captcha triggers, how to avoid or handle them and have guaranteed access to the content of the page.
  3. Handling paywalls, or the access to limited articles in a certain time, or banners that obfuscate the content with login screen.

I am really slow in executing and experimenting around these questions. And I feel unless I start working on it, I will keep collecting these questions and add more inertia to taking a crack at the problem.

Teaching programming to kids

I am a programmer and in past I have tried to engage kids and grownups with programming. I know for a fact that it is difficult. I also understand that I lack the professional skill sets needed to teach someone. But recently I came across ad campaigns that claimed that they could make children programming wizards, I found their ads amusing at best.

Programming community has a strong culture of documenting their journey. Journey of becoming a programmer, solving a problem, finding a novel way of doing something. And this documentation is very accessible. By accessible I mean written in a way that is conducive. This in turn creates more contributors, people write followup posts, adapt existing post to do something else, share their code, document how to use it. And we have this constant supply of content that ranges from newcomer friendly tutorials to deep dive, highly technical and good quality blogs. In my personal experience of learning to program, this is the version of community that I relate to the most.

All this content is accessible to everyone. Anyone can read these posts, learn from them, and become a Programmer. That's all it takes to be a programmer. To navigate these resources, understand them, use them, improvise, adapt, improve, fix. There is no expectation to be qualified in any respect. Just your computer, access to internet, and a commitment to your pursuit.

Indian IT service sector has thrived on this low hanging fruit. They have successfully proven that anyone can be trained to become a programmer to some level of proficiency. They hire students from any stream of engineering and train them, and at some point even an engineering degree was not required. Here is a quote from an article from 2013:

Companies such as Wipro, Infosys, TCS, Cognizant, ITC Infotech and KPIT Cummins among others are hiring more of science (BSc), computer science (BCA) graduates and even diploma holders for testing, support services and managing IT infrastructure of clients at lower wages.

Today everyone knows what it takes to be considered as a programmer. With projects like Github, Discord, outreach programs by big companies, whole ecosystem is shifting to be more beginner friendly. If we go by Dollar Dreams(movie by Sekhar kammula), all it takes to be hired by a company as a software developer is confidence. My friend Sai teja summed it up eloquently:

So in general there is an awareness now more than ever that programming is skill that doesn't require a degree and easier to teach relatively and has significant impact on future prospects of students who understand programming.

It is safe to say that programming is a skill, just like music, painting, woodwork. They all can be taught. And that's where the ad campaigns from companies that teach programming to kids steps in. They are aggressive, and excessively use hyperbole to amp their appeal. There is no other company running campaigns that tries to makes a pitch to teach children other skills. Why do we have such campaign only for programming skill? Firstly it can be attributed to the domain itself, with tech things can scale for mass and later there is high scope of optimizing for profits. Secondly, I think the analogy between Cricket and other sports would fit here. Just like Kohli is more relatable figure as compared to Messi, now we have local figures who have thrived in technology domain, Pichai(Google), Nadella(Microsoft), Bansal(flipkart) et all. People can aspire for their kids to grow up to such a role or position. And yes, it is not easy and it takes lot of hard work to reach to those top spots. Not every programmer gets hired by Google, Facebook or creates a mobile app that's worth million dollar. But even an average programmer has a shot to be hired by some IT company and can earn a decent living from this skill. Another of my friend Puneeth summed it up like this:

Most things are a skill that can be taught… like playing Cricket to singing. Some people have some quirks that make them special at it, but you can be an average programmer or an average musician or cricketer with training. The lure is much bigger with programming than the other things by the fact that there's a lot of scope for making a living out of it.

The ad campaign we mentioned earlier, firstly, their appeal is to the parents. There message is simple, "Teach your kids programming, it is as essential as learning English, if you don't know it, you will be left out from a lot of opportunities if not all". And parents buy into it. They are inherently concerned about future of their kids. Automation in its new form is eating everyone's lunch, school curriculum moves at glacial pace to adapt to ever changing demands of the market. It makes sense to think in this line of direction, advancements in automation are largely driven by advancements in Machine learning and Artificial Intelligence, so lets be part of the technology industry. But the tech industry itself is noticing huge changes because of this shift and we are not immune of the threat of becoming obsolete. We will have to adjust, pivot, learn new things to stay relevant, often on out own, without a course or personal instructor, just relying on the online resources I was mentioning earlier. Here is the thing, the courses that teaches programming is just one of the gateway into this world where you will have to constantly keep learning and unlearning. This would be a start of the journey. I think the point I am trying to make is, it will become harder and harder to survive as an average programmer, or rather anything average. Any course(paid or free) that triggers genuine curiosity in the kids about the subject, help them understand the domain and they feel confident solving problems, go for them. When I look at ads from WhiteHatJr, I don't feel that. The promises they are selling have nothing to do with programming as a skill.

I am repeating myself, but I want to stress on the point that there are many, many resources, courses that are available for free already. I think they are the best starting point to gauge the interest of the kid. Scratch offers a similar approach to programming, so does blockly, and there are many such projects. They have intuitive user interface that eases the steep learning curve of understanding core concepts of programming. And don't underestimate kids, they can navigate complex situation on their own and learn something from it. Just look at the contemporary popular games, they are hard and complex, they are multiplayer and kids have to work as a team to get through them. Yet with the playfulness attitude and with absence of fear of loosing, they thrive, one stage, one boss, one challenge at a time.

I am personally biased against the specific ad campaigns. They are misleading, dubious and I get reminded of IIPM ad campaign, "Dare to think beyond IIMs" and that didn't end well. They are tapping into insecurities, doubts of parents. I have serious doubts on how much they would be able to deliver on the dreams they are selling. Their methods, their talent pool of house wives turned programmer instructors and their claims of fictitious alumni students doesn't matter.

Clojure: Apply new learning to do things better

Context: I was learning Clojure while actively using it to create a CLI tool at my work. In past I have worked a lot with Python. In this post I am documenting the evolution of certain code as I learn new concepts, improve the readability and refactor the solution and get new doubts. I am also trying to relate this learning process to other content I have come across web.

Problem statement:

I have flat CSV data. Some of the rows are related based on common values(order_id, product_id, order_date).

Task: Consolidate(reduce) orders from multiple rows that have same order_id with a different restructured data.

group-by would take me almost there. But I need a different format of data, Single entry per order_id, all products belonging to it should be under items key.

order_id;order_date;firstname;surname;zipcode;city;countrycode;quantity;product_name;product_id
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;2;Product 1 - black;400412441
3489066;20200523;Guy;Threepwood;10997;Berlin;DE;1;Product 2 - orange;400412445
3481021;20200526;Murray;The skull;70971;Amsterdam;NL;1;Product - blue;400412305
3481139;20200526;Haggis;MacMutton;80912;Hague;NL;5;Product 1 - black;400412441

First attempt:

After reading first few chapters from Brave and True and with lot of trial and error, I got following code to give me the results I wanted:

(defn read-csv
  ""
  [filename]
  (with-open [reader (io/reader filename)]
    (into [] (csv/read-csv reader))))

(defn get-processed-data
  "Given filename of CSV data, returns vector of consolidated maps over common order-id"
  [filename]
  ;; key for order
  (defrecord order-keys [order-id order-date first-name second-name
			 zipcode city country-code quantity product-name product-id])
  (def raw-orders (read-csv filename))
  ;; Drop header row
  (def data-after-removing-header (drop 1 raw-orders))
  ;; Split each row over ; and create vector from result
  (def order-vectors (map #(apply vector (.split (first %) ";")) data-after-removing-header))
  ;; Convert each row vector into order-map
  (def order-maps (map #(apply ->order-keys %) order-vectors))
  ;; Keys that are specific to product item.
  (def product-keys [:product-id :quantity :product-name])
  ;; Keys including name, address etc, they are specific to user and order
  (def user-keys (clojure.set/difference (set (keys (last order-maps))) (set product-keys)))
  ;; Bundle product items belonging to same order. Result is hash-map {"order-id" [{product-item}]
  (def order-items (reduce (fn [result order] (assoc result (:order-id order) (conj (get result (:order-id order) []) (select-keys order product-keys)))) {} order-maps))
  ;; Based on bundled products, create a new consolidated order vector
  (reduce (fn [result [order-id item]] (conj result (assoc (select-keys (some #(if (= (:order-id %) order-id) %) order-maps) user-keys) :items item))) [] order-items))

I am already getting anxious from this code. Firstly, number of variables are completely out of hand. Only last expression is an exception because it is returning the result. Secondly, if I tried to club some of steps, like dropping first row, creating a vector and then create hash map, it looked like:

(def order-maps (map #(apply ->order-keys (apply vector (.split (first %) ";"))) (drop 1 raw-orders)))

Code was becoming more unreadable. I tried to compensate it with the elaborate doc-strings but they aren't that helpful either.

In python, when I tried quickly to write an equivalent:

order_keys = ['order-id', 'order-date', 'first-name', 'second-name',
	      'zipcode city', 'country-code', 'quantity', 'product-name', 'product-id']
raw_orders = [dict(zip(order_keys, line.split(';'))) for line in csv_data.split('\n') if 'order' not in line]
order_dict = {}
product_keys = ['quantity', 'product-name', 'product-id']
for row in raw_orders:
    order_id = row[0]
    try:
	order_dict[order_id]['items'].append(row[-3:])
    except KeyError:
	order_dict[order_id] = {'items': row[-3:],
				'order-details': row[1:-3]}
order_dict.values()

Not the cleanest implementation, but by the end of it I will have consolidated product-items per order in a list with all other details.

And I think this is part of the problem. I was still not fully adapted to the ways of Clojure. I was forcing python's way of thinking into Clojure. It was time to refactor, learn more and clean up the code.

Threading Macros - Revisiting the problem:

I was lost, my google queries became vague, avoid creating variables in clojure, I paired with colleagues to get a second opinion. Meanwhile I thought of documenting this process in #Writing-club. As we were discussing what I would be writing, Punchagan introduced me to concept of threading macros. I was not able understand or use them right away. It took me time to warm up to their brilliance. I started refactoring above code into something like:

(ns project-clj.csv-ops
  (:require [clojure.data.csv :as csv]
	    [clojure.java.io :as io]
	    [clojure.string :as string]))

(defn remove-header
  "Checks first row. If it contains string order, drops it, otherwise returns everything.
  Returns vector of vector"
  [csv-rows]
  (if (some #(string/includes? % "order") (first csv-rows))
    (drop 1 csv-rows)
    csv-rows))

(defn read-csv
  "Given a filename, parse the content and return vector of vector"
  [filename]
  (with-open [reader (io/reader filename)]
    (remove-header (into [] (csv/read-csv reader :separator \;)))))

(defn get-items
  "Given vector of vector of order hash-maps with common id:
   [{:order-id \"3489066\" :first-name \"Guy\":quantity \"2\" :product-name \"Product 1 - black\"  :product-id \"400412441\" ... other-keys}
    {:order-id \"3489066\" :first-name \"Guy\" :quantity \"1\" :product-name \"Product 2 - orange\" :product-id \"400412445\"}]

   Returns:
   {:order-id \"3489066\"
   :items [{:product-id \"400412441\" ...}
	   {:product-id \"400412445\" ...}]}"
  [orders]
  (hash-map
     :order-id (:order-id (first orders))
     :items (vec (for [item orders]
			  (select-keys item [:product-id :quantity :product-name])))))

(defn order-items-map
  "Given Vector of hash-maps with multiple rows for same :order-id(s)
   Returns Vector of hash-maps with single entry per :order-id"
  [orders]
  (->> (vals orders)
       (map #(get-items %) ,,,)
       merge))

(defn user-details-map
  "Given Vector of hash-maps with orders
   Returns address detail per :order-id"
  [orders]
  (->> (vals orders)
       (map #(reduce merge %) ,,,)
       (map #(dissoc % :product-id :quantity :product-name))))

(defn consolidate-orders
  "Given a vector of orders consolidate orders
  Returns vector of hash-maps with :items key value pair with all products belonging to same :order-id"
  [orders]
  (->> (user-details-map orders)
       (clojure.set/join (order-items-map orders) ,,,)
       vector))

(defn format-data
  [filename]
  (defrecord order-keys [order-id order-date first-name second-name
			 zipcode city country-code quantity product-name product-id])
  (->> (read-csv filename)
       (map #(apply ->order-keys %) ,,,)
       (group-by :order-id ,,,)
       consolidate-orders ,,,)))

Doubts/Observations

Although an improvement from my first implementation, refactored code has its own set of new doubts/concerns. Punchagan :

  • How to handle error?
  • How to short circuit execution when one of function/macro fails(using some->>)?
  • Some lines are still very dense.
  • Clojure code has gotten bigger.
  • Should this threading be part of a function and I should write tests for that function?

As readings and anecdotes shared from people in above referred articles suggest, I need to read/write more, continue on the path of Brave and True and not get stuck in loop of advanced beginner.

Second Brain - Archiving: Keeping resources handy

Problem Statement

We are suffering from information overloading, specially from the content behind the walled gardens of Social Media Platforms. The interfaces are designed to keep us engaged by presenting to us latest, popular, and riveting content. It is almost impossible to revisit the source or refer to such a piece of content sometime later. There are many products that are offering exactly that, “Read it later”, moving content off of these feeds and provide a focused interface to absorb things. I think following are some crucial flaws with the model of learning and consuming the content from Social Media Platforms:

  1. Consent: Non consensual customization aka optimization of the feed.
  2. Access: There is no offline mode, content is locked in.
  3. Intent: Designed to trigger a response(like, share, comment) from a user.

In rest of the post I would like to make a case that solving for “Access” would solve the remaining two problems also. When we take out the content from the platform we have the scope of rewriting the rules of engagement.

Knowledge management

As a user I am stuck in a catch 22 situation. Traditional media channels are still catching up, for any developing story their content is outdated. Social media is non stop, buzzing 24x7. How to get just the right amount of exposure and not get burned? How to regain the right of Choice? How to return to what we have read and established it as a fact in our heads? Can we reminisce our truths that are rooted in these posts?

These feeds are infinite. They are the only source of eclectic content, voices, stories, opinions, hot takes. As long as the service is up, the content would exist. We won’t get an offline experience from these services. Memory is getting cheaper everyday, but so is Internet. Social media companies won’t bother with an offline service because they are in complete control of the online experience, they have us hooked. Most importantly, offline experience doesn’t align with their business goals.

I try to keep a local reference of links and quotes from the content I read on internet in org files. It is quite an effort to manage that and maintaining the habit. I have tried to automate the process by downloading or making a note of the links I share with other people or I come across(1, 2). I will take another shot at it and I am thinking more about the problem to narrow down the scope of the project. There are many tools, products and practices to organize the knowledge in digital format. They have varying interfaces, from annotating web pages, papers, books, storing notes, wiki pages, correlate using tags, etc. I strongly feel that there is a need for not just annotating, organizing but also archiving. Archives are essential for organizing anything. And specifically: Archive all your Social Media platforms. Get your own copy of the data: posts, pictures, videos, links. Just have the dump, that way:

  1. No Big brother watching over the shoulder when you access the content. Index it, make it searchable. Tag them, highlight them, add notes, index them also, they can be searched too.
  2. No Censorship: Even if any account you follow gets blocked, deleted, you don’t loose the content.
  3. No link rot: If link shared in post is taken down, becomes private or gets blocked, you will have your own copy of it.

This tool, the Archives, should be personal. Store locally or on your own VPS, just enable users to archive the content in first place. How we process the content is a different problem. It is related and part of the bigger problem of how we consume the content. Ecosystem of plugging the archives with existing products can and will evolve.

Features:

In P.A.R.A method, a system to organize all your digital information, they talk about Archives. It is a passive collection of all the information linked to a project. In our scenario, the archive is a collection of all the information from your social media. In that sense, I think this Archive tool should have following features:

  • Local archive of all your social media feeds. From everyone you follow, archive what they share:
    • Web-pages, blogs, articles.
    • Images.
    • Audios, podcasts.
    • Videos.
  • Complete social media timelines from all your connections is accessible, available, locally. Customize, prioritize, categorize, do what ever you would like to do. Take back the control.
  • Indexed and searchable.

Existing products/methods/projects:

The list of products is every growing. Here are a few references that I found most relevant:

Thank you punchagan for your feedback and review of the post.