Skip to main content

Command line arguments with Clojure

I am new to Clojure land and I am working on a command line tool using it. I found tools.cli library for processing command line arguments. From the documentation it looked that it has lot of features but I got overwhelmed by it. I wanted something simpler. For me, simpler meant that I would be able to get more comfortable with Clojure Syntax. My requirements were straightforward, first argument would be the name of the task and rest of the arguments would be associated to that task.

While searching clojuredocs showed: *command-line-args*

A sequence of the supplied command line arguments, or nil if none were supplied

It looked good. Syntax was easy. I could do (first *command-line-args*) to get the first argument, (second ..), would give me second, so on and so forth. I tested it with lein run arg1 arg2 and it worked as expected. Fine, done.

Later, I created a standalone jar of the tool. Strangely with jar (using lein uberjar ) as I passed arguments to java -jar cli-tool.jar arg1 arg2 my command line arguments didn't get identified. Seems *command-line-args* didn't work with java (?). I checked that main function takes & args as argument and it was a sequence. From the book CLOJURE for the BRAVE and TRUE

The term sequence here refers to a collection of elements organized in linear order

And

Lists, vectors, sets, and maps all implement the sequence abstraction.

So ideally I should be able to do (first args) and that should also work like it did for *command-line-args*. I quickly tried that, I replaced all *command-line-args* with args. lein run worked as expected and when I created a standalone jar even that was able to process my command line arguments. Cheers for the abstraction :)

Engaging with "I have an idea" pitch

Some Background

I am a software engineer, working from home, in a neighborhood where the idea is still catching up. Often folks would ask me about what exactly I do and at times conversation would segue into them running their product idea through me. The questions would be: can it be done(Yes), what would it take(app, servers? aka nuts and bolts), how much would it cost and in the end an assertion, "It can work, right?". There is this particular fella with whom I used to play gully cricket and stuff when we were younger. His elder brother is also software engineer working in a bank. He has done, running app idea, with me couple of times already. Sharing one such conversation:

Phone call

him> Yaar/Dude, how much would it cost to make an Android App?

me> Umm…. <confused>. It depends on what App you want to make bhai.

him> Let us say, I want to make a app around delivering grocery items. This lockdown is really hard. There is no business. No timeline on when will it resume. I have been thinking lately to do something in meanwhile. You know to keep the cash flowing. Something.

me> I can understand bhai. But this domain is tough. There is a lot of competition. Many people are doing this. Narrow margins and often there are none.

him> We will start with this colony. Just make them buy from us, give them discount. You just tell me, how much will it cost? Can I get good folks to develop for me?

me> It will cost yaar. It won't be cheap.

him> I want something simple. A friend of mine recently released similar grocery App. It was bad, really bad. There was no way to search items. He hired someone and he charged him almost 2Lakh INR, took four months and made this unusable, crappy App. I can do something better.

me> That is cheap re.

him> Cheap? 2 Lakh? That is not cheap. In these tough times, he won't be able to get his investment back from this App.

me> Well, that's the problem right? In such a business, with money constraints, you invest to save money, you can't invest expecting it(App) to make money.

him> What?

me> In your case, Software or "App" is either expanding your customer base or automating something. It would be reducing manual work and saving you time in some way. Let us assume you have an existing business, you have your customers, inventory and folks that work in the shop. Now in lockdown you want to start home delivery. Okay, it makes sense. People can call or text their items to your Watsapp number and you have them delivered.

him> Yes. But then, what does the app do?

me> Getting there. You have some person manning your Watsapp and your phone. Noting down orders making sure they are getting delivered. You existing supply chain. Now as you grow, it will be hard to manage all this, taking calls, reading texts, coordinating and delivery. In that particular scenario you want to optimize your distribution and ordering and inventory system. There investing in the App makes sense. You have fixed inventory, people scroll, place an order and on your side of App, you have way to make sure that all orders are fulfilled.

me> Again, without such operation, you can't invest in App expecting it will start making you money. Does that make sense? Also, trust me, hiring a good developer to deliver you an working App would cost good money. Start small, start taking orders, phone, text. Reconsider App when you have things working and have some surplus to improve the process.

He was advised similarly from other folks(including his elder brother) too. After a few days, he called again saying he has started taking the orders and if I need anything, I should give him a call or text him.

Selenium based frontend tests using Python and Docker

Goal of the post is:

Using python write frontend tests for your Website that can run on a remote server, to make sure the site is operational.

lets elaborate more on some terms: Frontend tests, remote server and write tests in python.

  1. Frontend tests or more like browser based tests. They can be used to instruct a browser to:

    1. Open a given URL.
    2. Make sure that webpage has certain key elements, like signup, login etc.
    3. Complete a new signup process.
    4. Login using newly created account.
    5. Confirm that once logged in, again certain attributes are specific to the newly created user.

    These steps should be automated. A popular tool of choice for doing this is Selenium. The project provides driver binaries for different browsers and API bindings with all popular programming languages. Given that our requirement is to write tests in python, we will use python bindings.

  2. Getting tests to run on Remote server. Or more like, setting up the tests in such a manner that they can be run on any environment. Lets breakdown what we mean by this:
    1. Download binary drivers for all the browsers you would test with.
    2. Get python bindings, dependencies installed.
    3. And run the tests in headless browsers. Generally when you run selenium based tests, you would notice a browser open up and do all the steps you have in your tests. That means a working display environment in which browsers can open and render the site. On remote servers, we don't have traditional user interface running. So we would run UI-based browser tests on a browser without its graphical interface. This is known as headless browser.

A headless browser is a great tool for automated testing and server environments where you don't need a visible UI shell.

Okay, how do we do this?

I will use docker and docker-compose for this, because:

  1. The solution can run on any user environment(operating systems) that supports docker based development.
  2. docker-compose can take care of our requirements of providing multiple headless browsers.

From official document:

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Selenium project provides docker images for different browsers. It has a concept of Grid that:

is a smart proxy server that allows Selenium tests to route commands to remote web browser instances. Its aim is to provide an easy way to run tests in parallel on multiple machines.

In such a grid, we can create a selenium-hub that can route tests to different nodes running different browsers that are regestired with hub. We will see how to use docker-compose to setup such a grid, register nodes running specific browser and finally write tests that uses this grid to run frontend tests.

This is the directory structure:

$ ls
docker-compose.yml  Dockerfile  entrypoint.sh  requirements.txt  webtests.py

Selenium's docker documentation talks about an example compose configuration:

version: "3"
services:
  tests:
    build:
      context: .
      dockerfile: Dockerfile
    depends_on:
      - firefox
  selenium-hub:
    image: selenium/hub:3.141.59-20200409
    container_name: selenium-hub
    ports:
      - "4444:4444"
  firefox:
    image: selenium/node-firefox:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444

We have three services in this configuration, first one is tests, this would be our test setup, we will look at that in a moment. Second one is selenium-hub that creates a Hub that can "expose" access to different kinds of browsers. And lastly, third service is firefox service, that registers to to selenium-hub. It will be responsible for running tests in Firefox browser.

Given that this is docker land, we will write a Dockerfile that would create our tests service:

Dockerfile:

FROM python:3.7

WORKDIR /opt

ADD webtests.py /opt
ADD entrypoint.sh /opt
ADD requirements.txt /opt

RUN python -m pip install --upgrade pip
RUN pip3 install -r requirements.txt

ENTRYPOINT ["/bin/bash", "-c", "/opt/entrypoint.sh"]

Our image is based on official python3.7 Docker image. We will add three additional files to that image. requirements.txt is needed to install packages to vanilla python3.7 to run our tests. requirements.txt

selenium==3.141.0

entrypoint.sh is a bash file that would run the tests:

#!/bin/bash
python -m unittest -v webtests.py

We will use example test from Selenium documentation. In our setUp we will add a BusyWait logic that waits for selenium-hub to become available. firefox driver is running inside a separate service, so we will use concept of remote WebDriver and connect to it via driver http://selenium-hub:4444/wd/hub:

import time
import unittest

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from urllib3.exceptions import MaxRetryError

class TestPythonOrgSearch(unittest.TestCase):
    def setUp(self):
	while True:
	    try:
		self.driver = webdriver.Remote(
		    command_executor='http://selenium-hub:4444/wd/hub',
		    desired_capabilities=DesiredCapabilities.FIREFOX
		)
	    except (WebDriverException, MaxRetryError):
		print('Waiting for selenium hub to become available...')
		time.sleep(0.2)
	    else:
		print('Connected to the selenium hub.....')
		break

    def test_search_in_python_org(self):
	driver = self.driver
	driver.get("http://www.python.org")
	self.assertIn("Python", driver.title)
	elem = driver.find_element_by_name("q")
	elem.send_keys("pycon")
	elem.send_keys(Keys.RETURN)
	assert "No results found." not in driver.page_source

    def tearDown(self):
	self.driver.close()

if __name__ == "__main__":
    unittest.main()

That's it. To run this test do:

docker-compose build; docker-compose run --rm tests; docker-compose down

How to run same tests with different browsers?

Let us add Chrome service to our selenium-hub:

docker-compose.yml:

version: "3"
services:
  tests:
    build:
      context: .
      dockerfile: Dockerfile
    depends_on:
      - firefox
      - chrome
  selenium-hub:
    image: selenium/hub:3.141.59-20200409
    container_name: selenium-hub
    ports:
      - "4444:4444"
  chrome:
    image: selenium/node-chrome:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444
  firefox:
    image: selenium/node-firefox:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444

Now we have to configure our tests so that they can run with both the browsers. Based on a stackoverflow conversation, I used environment variable to do that. Change the entrypoint.sh bash script to:

#!/bin/bash

echo 'Running tests with firefox'
BROWSER=firefox python -m unittest -v webtests.py
echo 'Running tests with chrome'
BROWSER=chrome python -m unittest -v webtests.py

With this change, BROWSER is set as a environment variable and we can access it in our tests and switch the browsers:

#/usr/bin/env python3
import os
import time
import unittest
import warnings

from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from urllib3.exceptions import MaxRetryError

class TestPythonOrgSearch(unittest.TestCase):
    def setUp(self):
	warnings.simplefilter("ignore", ResourceWarning)
	if os.environ.get('BROWSER') == 'chrome':
	    browser = DesiredCapabilities.CHROME
	else:
	    browser = DesiredCapabilities.FIREFOX
	while True:
	    try:
		self.driver = webdriver.Remote(
		    command_executor='http://selenium-hub:4444/wd/hub',
		    desired_capabilities=browser
		)
	    except (WebDriverException, MaxRetryError):
		print('Waiting for selenium hub to become available...')
		time.sleep(0.2)
	    else:
		print('Connected to the selenium hub.....')
		break
    def test_search_in_python_org(self):
	driver = self.driver
	driver.get("http://www.python.org")
	self.assertIn("Python", driver.title)
	elem = driver.find_element_by_name("q")
	elem.send_keys("pycon")
	elem.send_keys(Keys.RETURN)
	assert "No results found." not in driver.page_source

    def tearDown(self):
	self.driver.close()

if __name__ == "__main__":
    unittest.main()

And as you run the tests again with same command:

$ docker-compose build; docker-compose run --rm tests; docker-compose down
[...]
Running tests with firefox
test_search_in_python_org (webtests.PythonOrgSearch) ... Waiting for selenium hub to become available...
Waiting for selenium hub to become available...
Waiting for selenium hub to become available...
Connected to the selenium hub.....
ok

----------------------------------------------------------------------
Ran 1 test in 14.396s

OK
Running tests with chrome
test_search_in_python_org (webtests.PythonOrgSearch) ... Connected to the selenium hub.....
ok

----------------------------------------------------------------------
Ran 1 test in 4.637s

OK

Comparing URLs for similarity

Urls, links, webpages are archived in WARC format. I had experimented with creating WARC files when I was working on SoFee. Back then I tried using library from Internet Archive, but it was not maintained, not compatible with python3 and didn't work.

I came across this another python archiving library: warcio, it has simple API that can create and read Archive files. It got me excited to resume working on SoFee 2.0.

warcio monkey patches requests and capture all GET requests to create a single WARC file. This WARC file can be stored and accessed anytime and ideally should render just like the original webpage, even if original is removed, deleted or no longer exists. To make the archive as close to the original we need to fetch all static content(images, javascript, css, icons) embedded or used in a webpage(HTML), I use HTML parsing library to find links to such resources. Then I repeatedly fetch these resources using requests library and meanwhile the warcio neatly tucks all these resources into single WARC file and it just works.

One step in optimizing the process of fetching these resources is to avoid redundant fetches of same URLs that don't appear similar. These URLs are in principle same, but because of some caveats are not similar in their string representation. For example:

  1. https://www.mygov.in/covid-19/ and https://www.mygov.in/covid-19 has trailing /, but both are same.
  2. https://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in subdomain wwww., and are similar.
  3. http://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in their protocol http and https, but are similar.

So I put together a small function that tries to compare these URLs and see if they are same or not:

from urllib.parse import urlparse
def check_url_similarity(url_1, url_2):
    '''Method to compare two URLs to identify if they are same or not.
    Returns bool: True/False based on comparison'''
    def check_path(path_1, path_2):
	# handles cases: cases where path are similar and just have a trailing /
	if path_1 == path_2:
	    return True
	if path_1 == path_2+'/' or \
	       path_1+'/' == path_2:
	    return True
	else:
	    return False

    if len(url_2) == len(url_1):
	if url_1 == url_2:
	    return True
    else:
	url_1_struct = urlparse(url_1)
	url_2_struct = urlparse(url_2)
	if url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
	if url_1_struct.netloc == 'www.'+url_2_struct.netloc or \
	   'www.'+url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
    return False

And I wrote these tests to make sure that this function is doing what I expect it to do:

class TestUrlSimilarity(unittest.TestCase):
    def test_trailing_slash(self):
	url_1 = "https://www.mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain_and_trailing_slash(self):
	url_1 = "https://mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19/"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_http_difference(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "http://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_different_url(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.india.gov.in/"
	self.assertFalse(check_url_similarity(url_1, url_2))

Information Visualization: Interpretations and Stories around them.

Nine shared this great presentation from Gurman titled:

When Statistics become stories

It was part of her talk given at DesignUp 2019. In one slide she talked about irregular age spikes we have around multiple of 10s.

I am thinking of creating an exercise around this for the AI-ML workshop to be conducted later this month at NID Gandhinagar for New Media Design Students.

At this stage of workshop, we would have covered basic concepts around programming and Jupyter notebooks.

Section One - Introducing Pandas

I have got French population and age distribution from here, and we have cleaned it to following structure:

Out[115]: 
   year   males  females   total  age
0  2018  364155   347749  711904    0
1  2017  370453   355472  725925    1
2  2016  378518   363162  741680    2
3  2015  387906   372402  760308    3
4  2014  399232   387042  786274    4

We would start with loading this data and introduce concepts of:

  1. Reading the data(in this case from csv file using, read_csv).
  2. Exploring the structure of data(DataFrame), accessing it, using Rows, Columns.
  3. Try basic operations over the data to answer some questions, like, for which age spectrum, male population is more than females and vice versa.
  4. We would explore the concept of using ? for getting access to documentation of the method/attribute.

Section Two - Plotting the data

After having played around with the data and different methods we would shift to plotting it and try to see if we can answer questions we had explored in previous section using the plots.

I am thinking of introducing them to plotting Pie Charts, Bar graphs, Lines. Age distribution of country is generally represented in Population Pyramid, here we would try to plot the same Pyramid for French population.

Section three - Exercise for students.

A similar UK age distribution of the population is available here. We would apply things we have learned in above two sections and ask the students to plot Population Pyramid for UK.

Section four - Census and Age distribution of Indian population:

Akash Gutha has a repository and a IPython notebook that:

  1. Fetches relevant data(excel sheet) from Indian Census site.
  2. Cleans up the data and assign names to the columns, and related plots.

We would work on top of those steps to:

  1. cover how Census releases data and an accompanying guide that helps people make sense of it.
  2. Plot Population Pyramid graph for India.
  3. Observe the difference between population distribution for India and UK/France.
  4. Also have an open discussion around the spikes for certain age.
  5. Share the screenshots from Gurman's presentation that explains the spikes.

At this point we conclude the session around handling data, information visualization. Possibly we will follow it with more hands on exercise for students.

Setting up an environment for a workshop based on Python.

I distinctly remember, while working at FOSSEE back in 2009-10, when we would conduct hands on workshop in the labs of various institutes, we would factor in significant time to reach early and setup all the dependencies on the lab computers. Back then we would use Enthought's binaries for Windows system to install everything. If we were lucky we would also find Linux machines in the lab and that would help a lot as we were really comfortable with installing the requirements using a CLI.

Recently we scheduled an AI/ML workshop for New Media Design students at NID Gandhinagar. While preparing for it I was looking for resources. I knew about Project Jupyter and IPython notebooks but my understanding of them was very limited.

I found that JupyterHub is brilliant project in terms of setting up the complete environment and sharing the resources with all the students. Their offering of the-littlest-jupyterhub which is targeted for 1-100 users hosted on single server is perfect. However it does need sudo and root privileges to segregate user environments. If at NID campus we get access to a server, I will try and see if I can set it up.

Otherwise, I also came across Colab from google, that comes with all dependencies, libraries installed to be used and shared with the students. It looks really promising. I will try to put together some notebooks and exercises around the concepts we would be covering and see how both these solutions fare.

But compared to the manual setup we used to do back then, this looks like a cakewalk.

Communications

I recently got into a tense conversation with a friend. We were talking about education and I was briefing him about some popular steps a particular government was taking. During that conversation, I think my friend was trying to make the case that the things I was mentioning weren't directly related to improving the quality of education or for the students and teachers. He was right. But at that time, I didn't realize that and got defensive in a way that derailed the whole conversation.

Lately, I have noticed that many times I don't completely understand what's being said and I end up interrupting the conversation. Things escalate from there. It is uncomfortable, tense, exhausting, tiresome and worst of all, the topic of conversation gets sidelined. Furthermore, even from my side, when I am trying to express myself, often I would use the wrong word. I think my communication skills need lot more work, practice and I have to be more mindful about it.

This is one reason I like these writing-club sessions. Writing is a good exercise, it clears out the noise and makes you more focused. I have been slacking lately on these sessions but I will try to improve on that front too.

SystemD Dependency Tree

At Senic, we have shifted to systemd for managing many independent application we have running on the Hub. Earlier we were using supervisord and for bunch of reasons(limit dependency, system supported solution etc). systemd provides many strong features, thing like:

uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, maintains mount and automount points, and implements an elaborate transactional dependency-based service control logic

We have put together different service files that starts applications as Hub boots. Some of these services have hard dependencies on others, meaning if parent service is not running, child service won't start/run. For example if we have an application which is making network request, in some scenarios it will help if that service is dependent on NetworkManager service which manages Network interfaces(or other native service which handles network connections).

This dependency tree has both benefits and issues. For us, some of the services(parent service), initializes DBus Objects. And child services connects or subscribe to these Objects, that enables DBus communication between separate applications. Now if Parent service dies(SIGTERM), child service can't continue and needs to stop. Here the systemd dependency tree takes care of this for us, it stops all dependent services if parent stops.

But in situation where parent service restarts, I would say, my understanding of systemd fails me. systemd correctly stops all the child services but it doesn't restart them once parent service starts again. I am not sure which dependency construct to use that (Before, After etc) make sure that once parent service restarts, all child process also restart.

All the services have a Restart clause to make sure that service restarts. But restart only happens in some certain scenarios. If a service is stopped using command systemctl stop service-name.service, systemd won't start the service again. And I think this is how child service gets stopped when parent service restarts and hence they don't restart. Maybe.

Working in someone else's kitchen

Yesterday I was pairing remotely with one of my colleague. He hosted a tmate session for me on his system. His preference of editor is Vim and I use Emacs. We were discussing some ideas on functions and what they would do and taking turn on writing the code. I know little bit of Vim, but my muscle memories are not tuned for Vim as much as they are for Emacs. So it took a while for me, I asked some silly questions on how he was doing certain things and it was nice how he was comfortably using the interface.

Today morning as I was preparing breakfast and looking for the tools in the kitchen it reminded me of yesterday's pairing session. In kitchen its the food and code in case of work. Just the tools are placed in different location and there are other ways of preparing things.

Both these exercise brings you out of your comfort zone. The keybindings for saving, editing, navigating are different in the editor. In kitchen, spices are in different box, the box itself is placed in different location, they grate the ginger instead of crushing it. It makes you more alert and self aware.

SoFee 2.0

From past few months I have not published publicly, but in drafts I was writing a small story. It took long to put together everything, and I now have a rough draft of story in place. But it is still not finished finished. I think it is same as personal software projects, experiments, never ending. There is always something to improve, fix, refactor/rewrite.

On that idea of on-going projects I will be picking up SoFee. With same features which I was aiming with first version. I want to make them modular which can work with each other on need to need basis.

I was also thinking of using Clojure this time. On that Punchagan correctly reminded me how fixing that could be counter productive to the project. In my first attempt for SoFee, I had decided to use python3 and many of web-page parsing libraries were still using python2. I spent long time to port abandoned archiving library of Warc to python3 and in the end that feature was not even shipped. So despite the temptation, as Punchagan suggested, it is best to look for best library available irrespective of language and put together a minimum feature, BUT complete module which can:

  1. Archive a link locally.
  2. Revisit those archives without an internet connection.
  3. Index the archives, make them searchable.
  4. Possibly a command line utility which can be extended with REST endpoints.

After that, I will pick up remaining features and try to build this, block by block.

One block at a time.