Second Brain - Archiving: Keeping resources handy


Problem Statement

We are suffering from information overloading, specially from the content behind the walled gardens of Social Media Platforms. The interfaces are designed to keep us engaged by presenting to us latest, popular, and riveting content. It is almost impossible to revisit the source or refer to such a piece of content sometime later. There are many products that are offering exactly that, “Read it later”, moving content off of these feeds and provide a focused interface to absorb things. I think following are some crucial flaws with the model of learning and consuming the content from Social Media Platforms:

  1. Consent: Non consensual customization aka optimization of the feed.
  2. Access: There is no offline mode, content is locked in.
  3. Intent: Designed to trigger a response(like, share, comment) from a user.

In rest of the post I would like to make a case that solving for “Access” would solve the remaining two problems also. When we take out the content from the platform we have the scope of rewriting the rules of engagement.

Knowledge management

As a user I am stuck in a catch 22 situation. Traditional media channels are still catching up, for any developing story their content is outdated. Social media is non stop, buzzing 24x7. How to get just the right amount of exposure and not get burned? How to regain the right of Choice? How to return to what we have read and established it as a fact in our heads? Can we reminisce our truths that are rooted in these posts?

These feeds are infinite. They are the only source of eclectic content, voices, stories, opinions, hot takes. As long as the service is up, the content would exist. We won’t get an offline experience from these services. Memory is getting cheaper everyday, but so is Internet. Social media companies won’t bother with an offline service because they are in complete control of the online experience, they have us hooked. Most importantly, offline experience doesn’t align with their business goals.

I try to keep a local reference of links and quotes from the content I read on internet in org files. It is quite an effort to manage that and maintaining the habit. I have tried to automate the process by downloading or making a note of the links I share with other people or I come across(1, 2). I will take another shot at it and I am thinking more about the problem to narrow down the scope of the project. There are many tools, products and practices to organize the knowledge in digital format. They have varying interfaces, from annotating web pages, papers, books, storing notes, wiki pages, correlate using tags, etc. I strongly feel that there is a need for not just annotating, organizing but also archiving. Archives are essential for organizing anything. And specifically: Archive all your Social Media platforms. Get your own copy of the data: posts, pictures, videos, links. Just have the dump, that way:

  1. No Big brother watching over the shoulder when you access the content. Index it, make it searchable. Tag them, highlight them, add notes, index them also, they can be searched too.
  2. No Censorship: Even if any account you follow gets blocked, deleted, you don’t loose the content.
  3. No link rot: If link shared in post is taken down, becomes private or gets blocked, you will have your own copy of it.

This tool, the Archives, should be personal. Store locally or on your own VPS, just enable users to archive the content in first place. How we process the content is a different problem. It is related and part of the bigger problem of how we consume the content. Ecosystem of plugging the archives with existing products can and will evolve.

Features:

In P.A.R.A method, a system to organize all your digital information, they talk about Archives. It is a passive collection of all the information linked to a project. In our scenario, the archive is a collection of all the information from your social media. In that sense, I think this Archive tool should have following features:

  • Local archive of all your social media feeds. From everyone you follow, archive what they share:
    • Web-pages, blogs, articles.
    • Images.
    • Audios, podcasts.
    • Videos.
  • Complete social media timelines from all your connections is accessible, available, locally. Customize, prioritize, categorize, do what ever you would like to do. Take back the control.
  • Indexed and searchable.

Existing products/methods/projects:

The list of products is every growing. Here are a few references that I found most relevant:

Thank you punchagan for your feedback and review of the post.

Striking a balance between Clobbering and Learning


Getting stuck as an "Advanced Beginner" happens. Specially in cases when we use a new tool or language to deliver a product/project. I have noticed that I approach things with a narrow mindset, I would use the tool or language to deliver what is desired. It will have expected features but its implementation won't be ideal. The process of unlearning these habit is long and often times with a deadline I end up collecting tech debt. Recently I came across some links that talked about this phenomena:

Related Conversations on Internet

There was a big thread on HackerNews around better way to learn CSS(https://news.ycombinator.com/item?id=23868355) and I found this comment relevant to my experience:

They always assume every one learned like them, by trying stuff out all of the time, until they got something working. Then they iterate from project to project, until they sorted out the bad ideas and kept the good ones. With that approach, learning CSS would probably have taken me 10 times as long.

Sure this doesn't teach you everything or makes you a pro in a week, but I always have the feeling people just cobble around for too long and should instead take at least a few days for a more structured learning approach.

Last statement of the comment struck a chord, cloberring has its limitation and it needs to be followed up with reading of fundamental concepts from a book, manual or docs.

Another post that was shared on HackerNews talks about Expert Beginner paradox: https://daedtech.com/how-developers-stop-learning-rise-of-the-expert-beginner/

There’s nothing you can do to improve as long as you keep bowling like that. You’ve maxed out. If you want to get better, you’re going to have to learn to bowl properly. You need a different ball, a different style of throwing it, and you need to put your fingers in it like a big boy. And the worst part is that you’re going to get way worse before you get better, and it will be a good bit of time before you get back to and surpass your current average.

Practices that can help with the process of clobbering and learning:

  1. Tests: unittests gives code a structure. They set basic expectations on how the code should and should not behave. If we maintain a uniform expectation through out the code base, unittests helps maintain a certain uniformity and quality.
  2. Writing documentation: For me this is like rubber duck debugging. It gives an active feedback on what are the deliverable, supported features, limitations, and upcoming features.
  3. Pairing with colleagues over the concepts and implementation. Walking through the code and explaining it to colleagues helps me identify sections of code that make me uncomfortable. Where am I weak and where should I focus to improve.
  4. Though similar to pairing, Code Reviews have their own importance and value.

These practices won't replace the need of reading Docs or Book, but they would certainly give you good quality code and keep your tech debt in check.

Clojure, hash-map, keys, keyword


tldr; Simple strings can be used as key to a hash-map. Either use get to lookup for them. Or convert them into keyword using keyword method.

hash-map are an essential Data Structures of Clojure. They support an interesting feature of keyword that can really enhance lookup experience in Clojure hash-map.

;; Placeholder, improve it
user=> (def languages {:python "Everything is an Object."
		       :clojure "Everything is a Function."
		       :javascript "Whatever you would like it to be."})
;; To lookup in map
user=> (:python languages)
"Everything is an Object."
user=> (get languages :ruby)
nil

Syntax is easy to understand and easy to follow. So far so good. I started using it here and there. At a point I came to a situation where I had to do a lookup in a map, using a variable:

user=> (def brands {:nike "runnin shoes"
  #_=> :spalding "basketball"
  #_=> :yonex "badminton"
  #_=> :wilson "tennis racquet"
  #_=> :kookaburra "cricket ball"})

(def brand-name "yonex")

Because we have used keyword in map brands, we can't user value stored in variable brand-name directly to do a lookup in the map. I tried silly things like :str(brand-name) (results in Execution error ) or :brand-name (returns nil ). I got confused on how to do this. Almost all examples in docs were using keyword. I tried a few things and understood that we can indeed use string as key and to fetch the value use get function:

user=> (def brands {"nike" "runnin shoes"
  #_=> "spalding" "basketball"
  #_=> "yonex" "badminton"
  #_=> "wilson" "tennis racquet"
  #_=> "kookaburra" "cricket ball"})
#'user/brands
user=> (get brands brand-name)
"badminton"

While using keyword has simpler syntax, at times when I am using external APIs it is easier to work with string or lookup for a key in hash-map using variable. In python I do it all the time. Though I am not sure if using string as key is the recommended way.

Update <2020-07-08 Wed>

punch and I were discussing this post and he mentioned that in lisp we can use keyword as method to convert string into a keyword. After a quick search, TIL, indeed we can keyword a variable. The method converts string into a equivalent keyword:

user=> (def brands {"nike" "runnin shoes"
  #_=> "spalding" "basketball"
  #_=> "yonex" "badminton"
  #_=> "wilson" "tennis racquet"
  #_=> "kookaburra" "cricket ball"})
#'user/brands
user=> (keyword brand-name)
:yonex
user=> ((keyword brand-name) brands)
"badminton"

Clojure Command Line Arguments II


I wrote a small blog on parsing command line earlier this week. The only comment I got on it from punch was:

I'd like to learn from this post why command-line-args didn't work. What is it, actually, the command-line-args thingy, etc.?

Those are good questions. I didn't know answer to them. As I talk in post I wanted a simple solution to parsing args and another confusing experience brought me back to the same question, "Why args behave the way they do and What are *command-line-args*". I still don't have answer to them. In this post I am documenting two things, for my own better understanding. One around reproducing the issue of jar not able to work with *command-line-args*. Second one around limited features of sequence that are supported by args

Reproducing behaviour of jar (non)handling of *command-line-args*

We create a new project using lein:

$ lein new app cli-args
Generating a project called cli-args based on the 'app' template.
$ cd cli-args/
$ lein run
Hello, World!

We edit src/cli_args/core.clj to print args and *command-line-args*

cat <<EOF > src/cli_args/core.clj
(ns cli-args.core
  (:gen-class))

(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "Printing args.." args)
  (println "Printing *command-line-args*" *command-line-args*))
EOF

$ lein run optional arguments

Now we create jar using lein uberjar

$ lein uberjar
Compiling cli-args.core
Created target/uberjar/cli-args-0.1.0-SNAPSHOT.jar
Created target/uberjar/cli-args-0.1.0-SNAPSHOT-standalone.jar
$ cd target/uberjar/
$ java -jar cli-args-0.1.0-SNAPSHOT-standalone.jar testing more optional arguments
Printing args.. (testing more optional arguments)
Printing *command-line-args* nil

Clojure is able to handle *command-line-args* but java is not. That narrows down the problem and can possibly lead to explanation on why it is happening(Maybe in another post).

Sequence features supported by args

I noticed another anomaly with args. I was passing couple of arguments and I noticed that it doesn't support get method.

cat <<EOF > src/cli_args/core.clj
(ns cli-args.core
  (:gen-class))

(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "first argument" (first args))
  (println "second argument" (second args))
  (println "third argument" (get args 3)))
EOF

This is what I noticed as I tried different inputs:

$ lein run
first argument nil
second argument nil
third argument nil
$ lein run hello world
first argument hello
second argument world
third argument nil
$ lein run hello world 3rd argument
first argument hello
second argument world
third argument nil

The get method doesn't work. I printed type for args and it is clojure.lang.ArraySeq. For my case, I "managed" by using last and that gave me what I wanted. Still, I am running out of options and I would have to either dig deeper to understand args or fall back to using a library(tools.cli).

Command line arguments with Clojure


I am new to Clojure land and I am working on a command line tool using it. I found tools.cli library for processing command line arguments. From the documentation it looked that it has lot of features but I got overwhelmed by it. I wanted something simpler. For me, simpler meant that I would be able to get more comfortable with Clojure Syntax. My requirements were straightforward, first argument would be the name of the task and rest of the arguments would be associated to that task.

While searching clojuredocs showed: *command-line-args*

A sequence of the supplied command line arguments, or nil if none were supplied

It looked good. Syntax was easy. I could do (first *command-line-args*) to get the first argument, (second ..), would give me second, so on and so forth. I tested it with lein run arg1 arg2 and it worked as expected. Fine, done.

Later, I created a standalone jar of the tool. Strangely with jar (using lein uberjar ) as I passed arguments to java -jar cli-tool.jar arg1 arg2 my command line arguments didn't get identified. Seems *command-line-args* didn't work with java (?). I checked that main function takes & args as argument and it was a sequence. From the book CLOJURE for the BRAVE and TRUE

The term sequence here refers to a collection of elements organized in linear order

And

Lists, vectors, sets, and maps all implement the sequence abstraction.

So ideally I should be able to do (first args) and that should also work like it did for *command-line-args*. I quickly tried that, I replaced all *command-line-args* with args. lein run worked as expected and when I created a standalone jar even that was able to process my command line arguments. Cheers for the abstraction :)

Engaging with "I have an idea" pitch


Some Background

I am a software engineer, working from home, in a neighborhood where the idea is still catching up. Often folks would ask me about what exactly I do and at times conversation would segue into them running their product idea through me. The questions would be: can it be done(Yes), what would it take(app, servers? aka nuts and bolts), how much would it cost and in the end an assertion, "It can work, right?". There is this particular fella with whom I used to play gully cricket and stuff when we were younger. His elder brother is also software engineer working in a bank. He has done, running app idea, with me couple of times already. Sharing one such conversation:

Phone call

him> Yaar/Dude, how much would it cost to make an Android App?

me> Umm…. <confused>. It depends on what App you want to make bhai.

him> Let us say, I want to make a app around delivering grocery items. This lockdown is really hard. There is no business. No timeline on when will it resume. I have been thinking lately to do something in meanwhile. You know to keep the cash flowing. Something.

me> I can understand bhai. But this domain is tough. There is a lot of competition. Many people are doing this. Narrow margins and often there are none.

him> We will start with this colony. Just make them buy from us, give them discount. You just tell me, how much will it cost? Can I get good folks to develop for me?

me> It will cost yaar. It won't be cheap.

him> I want something simple. A friend of mine recently released similar grocery App. It was bad, really bad. There was no way to search items. He hired someone and he charged him almost 2Lakh INR, took four months and made this unusable, crappy App. I can do something better.

me> That is cheap re.

him> Cheap? 2 Lakh? That is not cheap. In these tough times, he won't be able to get his investment back from this App.

me> Well, that's the problem right? In such a business, with money constraints, you invest to save money, you can't invest expecting it(App) to make money.

him> What?

me> In your case, Software or "App" is either expanding your customer base or automating something. It would be reducing manual work and saving you time in some way. Let us assume you have an existing business, you have your customers, inventory and folks that work in the shop. Now in lockdown you want to start home delivery. Okay, it makes sense. People can call or text their items to your Watsapp number and you have them delivered.

him> Yes. But then, what does the app do?

me> Getting there. You have some person manning your Watsapp and your phone. Noting down orders making sure they are getting delivered. You existing supply chain. Now as you grow, it will be hard to manage all this, taking calls, reading texts, coordinating and delivery. In that particular scenario you want to optimize your distribution and ordering and inventory system. There investing in the App makes sense. You have fixed inventory, people scroll, place an order and on your side of App, you have way to make sure that all orders are fulfilled.

me> Again, without such operation, you can't invest in App expecting it will start making you money. Does that make sense? Also, trust me, hiring a good developer to deliver you an working App would cost good money. Start small, start taking orders, phone, text. Reconsider App when you have things working and have some surplus to improve the process.

He was advised similarly from other folks(including his elder brother) too. After a few days, he called again saying he has started taking the orders and if I need anything, I should give him a call or text him.

Selenium based frontend tests using Python and Docker


Goal of the post is:

Using python write frontend tests for your Website that can run on a remote server, to make sure the site is operational.

lets elaborate more on some terms: Frontend tests, remote server and write tests in python.

  1. Frontend tests or more like browser based tests. They can be used to instruct a browser to:

    1. Open a given URL.
    2. Make sure that webpage has certain key elements, like signup, login etc.
    3. Complete a new signup process.
    4. Login using newly created account.
    5. Confirm that once logged in, again certain attributes are specific to the newly created user.

    These steps should be automated. A popular tool of choice for doing this is Selenium. The project provides driver binaries for different browsers and API bindings with all popular programming languages. Given that our requirement is to write tests in python, we will use python bindings.

  2. Getting tests to run on Remote server. Or more like, setting up the tests in such a manner that they can be run on any environment. Lets breakdown what we mean by this:
    1. Download binary drivers for all the browsers you would test with.
    2. Get python bindings, dependencies installed.
    3. And run the tests in headless browsers. Generally when you run selenium based tests, you would notice a browser open up and do all the steps you have in your tests. That means a working display environment in which browsers can open and render the site. On remote servers, we don't have traditional user interface running. So we would run UI-based browser tests on a browser without its graphical interface. This is known as headless browser.

A headless browser is a great tool for automated testing and server environments where you don't need a visible UI shell.

Okay, how do we do this?

I will use docker and docker-compose for this, because:

  1. The solution can run on any user environment(operating systems) that supports docker based development.
  2. docker-compose can take care of our requirements of providing multiple headless browsers.

From official document:

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Selenium project provides docker images for different browsers. It has a concept of Grid that:

is a smart proxy server that allows Selenium tests to route commands to remote web browser instances. Its aim is to provide an easy way to run tests in parallel on multiple machines.

In such a grid, we can create a selenium-hub that can route tests to different nodes running different browsers that are regestired with hub. We will see how to use docker-compose to setup such a grid, register nodes running specific browser and finally write tests that uses this grid to run frontend tests.

This is the directory structure:

$ ls
docker-compose.yml  Dockerfile  entrypoint.sh  requirements.txt  webtests.py

Selenium's docker documentation talks about an example compose configuration:

version: "3"
services:
  tests:
    build:
      context: .
      dockerfile: Dockerfile
    depends_on:
      - firefox
  selenium-hub:
    image: selenium/hub:3.141.59-20200409
    container_name: selenium-hub
    ports:
      - "4444:4444"
  firefox:
    image: selenium/node-firefox:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444

We have three services in this configuration, first one is tests, this would be our test setup, we will look at that in a moment. Second one is selenium-hub that creates a Hub that can "expose" access to different kinds of browsers. And lastly, third service is firefox service, that registers to to selenium-hub. It will be responsible for running tests in Firefox browser.

Given that this is docker land, we will write a Dockerfile that would create our tests service:

Dockerfile:

FROM python:3.7

WORKDIR /opt

ADD webtests.py /opt
ADD entrypoint.sh /opt
ADD requirements.txt /opt

RUN python -m pip install --upgrade pip
RUN pip3 install -r requirements.txt

ENTRYPOINT ["/bin/bash", "-c", "/opt/entrypoint.sh"]

Our image is based on official python3.7 Docker image. We will add three additional files to that image. requirements.txt is needed to install packages to vanilla python3.7 to run our tests. requirements.txt

selenium==3.141.0

entrypoint.sh is a bash file that would run the tests:

#!/bin/bash
python -m unittest -v webtests.py

We will use example test from Selenium documentation. In our setUp we will add a BusyWait logic that waits for selenium-hub to become available. firefox driver is running inside a separate service, so we will use concept of remote WebDriver and connect to it via driver http://selenium-hub:4444/wd/hub:

import time
import unittest

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from urllib3.exceptions import MaxRetryError

class TestPythonOrgSearch(unittest.TestCase):
    def setUp(self):
	while True:
	    try:
		self.driver = webdriver.Remote(
		    command_executor='http://selenium-hub:4444/wd/hub',
		    desired_capabilities=DesiredCapabilities.FIREFOX
		)
	    except (WebDriverException, MaxRetryError):
		print('Waiting for selenium hub to become available...')
		time.sleep(0.2)
	    else:
		print('Connected to the selenium hub.....')
		break

    def test_search_in_python_org(self):
	driver = self.driver
	driver.get("http://www.python.org")
	self.assertIn("Python", driver.title)
	elem = driver.find_element_by_name("q")
	elem.send_keys("pycon")
	elem.send_keys(Keys.RETURN)
	assert "No results found." not in driver.page_source

    def tearDown(self):
	self.driver.close()

if __name__ == "__main__":
    unittest.main()

That's it. To run this test do:

docker-compose build; docker-compose run --rm tests; docker-compose down

How to run same tests with different browsers?

Let us add Chrome service to our selenium-hub:

docker-compose.yml:

version: "3"
services:
  tests:
    build:
      context: .
      dockerfile: Dockerfile
    depends_on:
      - firefox
      - chrome
  selenium-hub:
    image: selenium/hub:3.141.59-20200409
    container_name: selenium-hub
    ports:
      - "4444:4444"
  chrome:
    image: selenium/node-chrome:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444
  firefox:
    image: selenium/node-firefox:3.141.59-20200409
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444

Now we have to configure our tests so that they can run with both the browsers. Based on a stackoverflow conversation, I used environment variable to do that. Change the entrypoint.sh bash script to:

#!/bin/bash

echo 'Running tests with firefox'
BROWSER=firefox python -m unittest -v webtests.py
echo 'Running tests with chrome'
BROWSER=chrome python -m unittest -v webtests.py

With this change, BROWSER is set as a environment variable and we can access it in our tests and switch the browsers:

#/usr/bin/env python3
import os
import time
import unittest
import warnings

from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from urllib3.exceptions import MaxRetryError

class TestPythonOrgSearch(unittest.TestCase):
    def setUp(self):
	warnings.simplefilter("ignore", ResourceWarning)
	if os.environ.get('BROWSER') == 'chrome':
	    browser = DesiredCapabilities.CHROME
	else:
	    browser = DesiredCapabilities.FIREFOX
	while True:
	    try:
		self.driver = webdriver.Remote(
		    command_executor='http://selenium-hub:4444/wd/hub',
		    desired_capabilities=browser
		)
	    except (WebDriverException, MaxRetryError):
		print('Waiting for selenium hub to become available...')
		time.sleep(0.2)
	    else:
		print('Connected to the selenium hub.....')
		break
    def test_search_in_python_org(self):
	driver = self.driver
	driver.get("http://www.python.org")
	self.assertIn("Python", driver.title)
	elem = driver.find_element_by_name("q")
	elem.send_keys("pycon")
	elem.send_keys(Keys.RETURN)
	assert "No results found." not in driver.page_source

    def tearDown(self):
	self.driver.close()

if __name__ == "__main__":
    unittest.main()

And as you run the tests again with same command:

$ docker-compose build; docker-compose run --rm tests; docker-compose down
[...]
Running tests with firefox
test_search_in_python_org (webtests.PythonOrgSearch) ... Waiting for selenium hub to become available...
Waiting for selenium hub to become available...
Waiting for selenium hub to become available...
Connected to the selenium hub.....
ok

----------------------------------------------------------------------
Ran 1 test in 14.396s

OK
Running tests with chrome
test_search_in_python_org (webtests.PythonOrgSearch) ... Connected to the selenium hub.....
ok

----------------------------------------------------------------------
Ran 1 test in 4.637s

OK

Comparing URLs for similarity


Urls, links, webpages are archived in WARC format. I had experimented with creating WARC files when I was working on SoFee. Back then I tried using library from Internet Archive, but it was not maintained, not compatible with python3 and didn't work.

I came across this another python archiving library: warcio, it has simple API that can create and read Archive files. It got me excited to resume working on SoFee 2.0.

warcio monkey patches requests and capture all GET requests to create a single WARC file. This WARC file can be stored and accessed anytime and ideally should render just like the original webpage, even if original is removed, deleted or no longer exists. To make the archive as close to the original we need to fetch all static content(images, javascript, css, icons) embedded or used in a webpage(HTML), I use HTML parsing library to find links to such resources. Then I repeatedly fetch these resources using requests library and meanwhile the warcio neatly tucks all these resources into single WARC file and it just works.

One step in optimizing the process of fetching these resources is to avoid redundant fetches of same URLs that don't appear similar. These URLs are in principle same, but because of some caveats are not similar in their string representation. For example:

  1. https://www.mygov.in/covid-19/ and https://www.mygov.in/covid-19 has trailing /, but both are same.
  2. https://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in subdomain wwww., and are similar.
  3. http://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in their protocol http and https, but are similar.

So I put together a small function that tries to compare these URLs and see if they are same or not:

from urllib.parse import urlparse
def check_url_similarity(url_1, url_2):
    '''Method to compare two URLs to identify if they are same or not.
    Returns bool: True/False based on comparison'''
    def check_path(path_1, path_2):
	# handles cases: cases where path are similar and just have a trailing /
	if path_1 == path_2:
	    return True
	if path_1 == path_2+'/' or \
	       path_1+'/' == path_2:
	    return True
	else:
	    return False

    if len(url_2) == len(url_1):
	if url_1 == url_2:
	    return True
    else:
	url_1_struct = urlparse(url_1)
	url_2_struct = urlparse(url_2)
	if url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
	if url_1_struct.netloc == 'www.'+url_2_struct.netloc or \
	   'www.'+url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
    return False

And I wrote these tests to make sure that this function is doing what I expect it to do:

class TestUrlSimilarity(unittest.TestCase):
    def test_trailing_slash(self):
	url_1 = "https://www.mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain_and_trailing_slash(self):
	url_1 = "https://mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19/"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_http_difference(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "http://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_different_url(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.india.gov.in/"
	self.assertFalse(check_url_similarity(url_1, url_2))

Information Visualization: Interpretations and Stories around them.


Nine shared this great presentation from Gurman titled:

When Statistics become stories

It was part of her talk given at DesignUp 2019. In one slide she talked about irregular age spikes we have around multiple of 10s.

I am thinking of creating an exercise around this for the AI-ML workshop to be conducted later this month at NID Gandhinagar for New Media Design Students.

At this stage of workshop, we would have covered basic concepts around programming and Jupyter notebooks.

Section One - Introducing Pandas

I have got French population and age distribution from here, and we have cleaned it to following structure:

Out[115]: 
   year   males  females   total  age
0  2018  364155   347749  711904    0
1  2017  370453   355472  725925    1
2  2016  378518   363162  741680    2
3  2015  387906   372402  760308    3
4  2014  399232   387042  786274    4

We would start with loading this data and introduce concepts of:

  1. Reading the data(in this case from csv file using, read_csv).
  2. Exploring the structure of data(DataFrame), accessing it, using Rows, Columns.
  3. Try basic operations over the data to answer some questions, like, for which age spectrum, male population is more than females and vice versa.
  4. We would explore the concept of using ? for getting access to documentation of the method/attribute.

Section Two - Plotting the data

After having played around with the data and different methods we would shift to plotting it and try to see if we can answer questions we had explored in previous section using the plots.

I am thinking of introducing them to plotting Pie Charts, Bar graphs, Lines. Age distribution of country is generally represented in Population Pyramid, here we would try to plot the same Pyramid for French population.

Section three - Exercise for students.

A similar UK age distribution of the population is available here. We would apply things we have learned in above two sections and ask the students to plot Population Pyramid for UK.

Section four - Census and Age distribution of Indian population:

Akash Gutha has a repository and a IPython notebook that:

  1. Fetches relevant data(excel sheet) from Indian Census site.
  2. Cleans up the data and assign names to the columns, and related plots.

We would work on top of those steps to:

  1. cover how Census releases data and an accompanying guide that helps people make sense of it.
  2. Plot Population Pyramid graph for India.
  3. Observe the difference between population distribution for India and UK/France.
  4. Also have an open discussion around the spikes for certain age.
  5. Share the screenshots from Gurman's presentation that explains the spikes.

At this point we conclude the session around handling data, information visualization. Possibly we will follow it with more hands on exercise for students.

Setting up an environment for a workshop based on Python.


I distinctly remember, while working at FOSSEE back in 2009-10, when we would conduct hands on workshop in the labs of various institutes, we would factor in significant time to reach early and setup all the dependencies on the lab computers. Back then we would use Enthought's binaries for Windows system to install everything. If we were lucky we would also find Linux machines in the lab and that would help a lot as we were really comfortable with installing the requirements using a CLI.

Recently we scheduled an AI/ML workshop for New Media Design students at NID Gandhinagar. While preparing for it I was looking for resources. I knew about Project Jupyter and IPython notebooks but my understanding of them was very limited.

I found that JupyterHub is brilliant project in terms of setting up the complete environment and sharing the resources with all the students. Their offering of the-littlest-jupyterhub which is targeted for 1-100 users hosted on single server is perfect. However it does need sudo and root privileges to segregate user environments. If at NID campus we get access to a server, I will try and see if I can set it up.

Otherwise, I also came across Colab from google, that comes with all dependencies, libraries installed to be used and shared with the students. It looks really promising. I will try to put together some notebooks and exercises around the concepts we would be covering and see how both these solutions fare.

But compared to the manual setup we used to do back then, this looks like a cakewalk.