Getting the act together, personally.

I started working on SocialMedia Feed project from last August. And time after time I had found myself in a slump, looking for motivation, head rush to finish a task. Try to watch movie, find some song-of-day, something, anything, and before I realize, its the next day already. They say, “Show up, show up, show up, and after a while the muse shows up, too” and personally while trying it, self motivation drags, in really bad way.

Last week I was trying to put together a POC for a possible paid work. Task was to create a chat bot around different columns of Excel sheet for users to be able to chat to and get analytical results in conversational manner. So instead of a technical person running a query raised by sales team, user can directly chat with a bot and get answer to a question like "How many tasks finished successfully today", tasks could be something like email campaign or nightly aggregation of data or results from a long running algorithm. Airtifical Intelligence Markup Language - AIML and its concepts are fairly popular to put together such a chat bot. Many of popular platforms like pandora bots,, help create such interface to "train" a bot. Like in case of the above message - How many tasks finished successfully today tasks, successful, today gives us the context on which column to run the query, what value to look for and duration over which we want to get the count.

I didn't have prior experience of using platform so I had planned to read up some docs and train the system to identify context, maybe a couple of them - ~2hrs of effort. I ended up starting to work on it only past 12 pm, browsing through random doc links, exploring sdk for examples, trying to find my way through setting up the pipeline. By evening 6, I had clocked around 3 hours and 30 minutes on this task, I got context in place, provided webhook to call third party API call so I used their github demo code and got result from the excel sheet which was shared, though very minimal work but still, I was able to sort out first level of unknowns.

It was frustrating, tiring, but I felt this pressure to finish it. Though the POC worked, I didn't get the work but what is really sad is the way I was able to stick to deadline when pressure of proving myself, convincing someone else was there. For SocialMedia Feed project, I have this task, on top of basic gensim topic words, implement algorithms from Termite and LDAvis paper to identify better topic representation and get this demo in place. I have reference code available(paper work is available on github), I know what I have to do, but still through last 3 days I haven't clocked single minute on this task. It will happen eventually, but I think idea is to get fired up personally, take on things, spend time on them and then mark them done, be professional, not just for others.

Shoutout to punchagan for his inputs on initial draft.

Service worker adventures

With SoFee major work is done in background using celery, polling twitter for latest status, extract the links, fetch their content and eventually the segregation of content would also be done this way. I was looking for a way to keep things updated on user side and concepts of Progressive web app were really appealing.

What does it do?

Browsers(google chrome, firefox et all) are becoming more capable as new web standards are rolling out, like having offline cache, push notifications, accessing hardware(physical web). With these features now HTML based websites can also work as an native app working on your phone(android, iPhone) or desktop.

Stumbling Block #1: Scope and caching

I am using Django and with it all static content(css, JS, fonts) gets served from /static. And for service workers, if we do that, its scope gets limited to /static, that is, it would be able to handle requests getting served under /static. This limits access to API calls I am making. I Looked around and indeed there was a stack-overflow discussion around the same issue. Its a hacky solution and I added on to it by passing on some get PARAMS which I can use in template rendering for caching user specific URLs.

Beyond this I had a few head scratchers while getting cache to work. I struggled quite a bit to short the fetch request and return cached response but it just won't work. I Kept on tweaking the code, experimenting things until I used Jake's trained-to-thrill demo as base to setup things from scratch and then build on top.

Stumbling Block #2: Push Notifications

Service worker provides access to background Push notification. In earlier releases, browsers would register for this service and return a unique Endpoint for subscription, a unique capability URL which is used by server to push notification to. While this endpoint provided by Firefox works out of box, for chromium and google chrome browser, it still returned an obsolete GCM based URL. Now google has started using Firebase SDK and GCM is no longer supported. Beyond this on service side PyFCM library worked just fine to push notifications and it works with firefox too.

Quest to find k for KMeans

With SoFee project, at the moment I am working on feature where system can identify content of link belonging to one of wider category say technology, science, politics etc. I am exploring topic modelling, word2vec, clustering and their possible combination to ship first version of the feature.

Earlier I tried to extract certain number of topics from content of all links. I got some results but through out the mix of links these topics were not coherent. Next I tried to cluster similar articles using KMeans algorithm and then extract topics from these grouped articles. KMeans requires one input parameter from user, k, number of clusters user wants to be returned. In the application flow asking user for such an input won't be intuitive so I tried to make it hands free. I tried two approaches:

  • Run KMeans algorithm with K varying from 2 to 25. I then tried to plot/observe average Silhouette score for all Ks, results I got weren't good that I could use/adopt this method.
  • There are two known methods, gap statistic method and another one potentially superior method which can directly return optimum value of K for a given dataset. I tried to reproduce results from second method but results weren't convincing.

Initially I got decent results: ideal_clusters.png

as I tried more iterations I got some mixed results: not_ideal_results.png

and further I got some crazy results: reverse_results.png

Following is the code I used for these results:

# Loosely adopted from
import matplotlib.pylab as plt
import numpy as np
from sklearn.cluster import KMeans
import random

def init_board_gauss(N, k):
    n = float(N)/k
    X = []
    for i in range(k):
	c = (random.uniform(-5, 5), random.uniform(-5, 5))
	s = random.uniform(0.05,0.5)
	x = []
	while len(x) < n:
	    a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
	    # Continue drawing points from the distribution in the range [-1,1]
	    if abs(a) < 5 and abs(b) < 5:
    X = np.array(X)[:N]
    return X

# This can be played around to confirm performance of f(k)
X = init_board_gauss(1200, 6)
fig = plt.figure(figsize=(18,5))
ax1 = fig.add_subplot(121)
ax1.set_xlim(-5, 5)
ax1.set_ylim(-5, 5)
ax1.plot(X[:,0], X[:, 1], '.', alpha=0.5)
tit1 = 'N=%s' % (str(len(X)))
ax1.set_title(tit1, fontsize=16)
ax2 = fig.add_subplot(122)
ax2.set_ylim(0, 1.25)
sk = 0
fs = []
centers = []
for true_k in range(1, 10):
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
    Nd = len(X[0])
    a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6
    if true_k == 1:
    elif sk == 0:
	fs.append(km.inertia_/(a(true_k, Nd)*sk))
    sk = km.inertia_

foundfK = np.where(fs == min(fs))[0][0] + 1
ax1.plot(centers[foundfK-1][:,0], centers[foundfK-1][:, 1], 'ro', markersize=10)
ax2.plot(range(1, len(fs)+1), fs, 'ro-', alpha=0.6)
ax2.set_xlabel('Number of clusters K', fontsize=16)
ax2.set_ylabel('f(K)', fontsize=16) 
tit2 = 'f(K) finds %s clusters' % (foundfK)
ax2.set_title(tit2, fontsize=16)
plt.savefig('detK_N%s.png' % (str(len(X))), \
	     bbox_inches='tight', dpi=100)

I had many bugs in my initial version of above code and while trying to crosscheck results I kept fixing them. Eventually I read the paper and I noticed plots which were similar to mine under results section(Figs 6n to o), with further explanation which went:

However, no simple explanation could be given for the cases shown in Figs 6n and o. This highlights the fact that f(K ) should only be used to suggest a guide value for the number of clusters and the final decision as to which value to adopt has to be left at the discretion of the user.

In comments section of DataSciencelab blog someone had mentioned that we won't get this kind of results with real data. With artificial data itself if proposed solution fails I think it can hardly help me to get to appropriate k directly. Maybe what I am seeking itself is not correct, lets see what we are able to put together.

Nayee Taleem

Krunal was visiting Nayee Talim/नई तालीम from dec 12th to 21st and asked me to join for a visit and understand the school model. After the visit to bhasha institute it was one more opportunity to understand how alternate schools are working. The philosophy behind the school has been formulated long ago Gandhi, around 1937 and goes something like: learning by doing and using real world as classroom to understand different subjects. In 2005 the school was reopened in Anand niketan and since then have been trying many things to improve the learning environment and working really hard to keep it relevant in current times.

The school compound itself is really beautiful. campus.jpg

We reached sevagram on 11th and from next day krunal and preeti were facilitating a three day workshop for teachers to explore different ideologies to make the school environment more "children centered". Many models of existing schools from around the world and india were presented, discussed and participants were encouraged to reflect if there was a similar environment in Nayi Taleem and if not what were the reasons behind it, can they be adopted for this school and if yes, how.

In school campus, many methods, activities and tailored curriculum are used to teach science, maths and other subjects. They have farms, art installations, craft and sewing classes. Like in case of farming, each class is allotted a piece of farm and they are responsible for growing, maintaining and selling the produce. produce.jpg

A show and tell session going on, one of the teacher caught a cobra snake/naag in the campus, he showed kids and teachers how it looks, how to identify one and also mentioned how its bite could be lethal, later same teacher released the snake in remote outskirts of village. show_and_tell.jpg

For lunch, classes gets their turn and students along with the teacher prepare lunch for whole school. Basically in krunal's word Gandhi had an idea that kids should be made familiar with the tools and how to use them and that's what the school seemed to be doing.

Despite all this, where they(school-admin, teachers, students) lack is, with current technological developments these tools have changed drastically and many have been rendered obsolete. For computer education they are using standard curriculum which introduces kids with paint, word and other basic computing skills and that too very superficial. Computers, mobiles, "smart" devices are everywhere but without knowing how to use these tools people in locality become mere consumer of things which are getting actively developed somewhere else and might not be meant to used in village's local context and need. I personally think diversity is good, different solutions to problem/challenge brings better insight to the problem and even better solutions.

We were thinking on how to introduce computers with DIY approach so that children can learn these modern tools just like they are learning other tools. While returning krunal mentioned about conducting some engaging and fun session where we explore different themes(games, makers lab kind of setup). I have tried to do this before but one thing I have realized is that the process of making is slow and I don't exactly know how to take sound-bite out of them and conduct sessions around them. Also my knowledge of making itself is limited. I was thinking of doing some gaming session leading to designing a simple game level and then playing it or doing some more engaging sessions using arduino/mobiles. I have exchanged models/references with krunal which can be used(1, 2, 3, 4, 5, 6), I will try to explore more on this and see what can be used or developed for Nayee Taleem.

Personally I am confused and find myself severely incompetent for this particular task. On one side I can totally relate to attempts being made to improve AI, projects like home automation. I was reading one HN thread and there are situations where people are depending on alexa, google home for knowing about weather outside their homes. Somehow I can't relate to that vision where people are so unaware of their surroundings. While on the other hand, kids at Nayee Taleem and Bhasha Institute and many other such place are very aware of the environment, they care for it and nurture it but I am not confident if this narrative will hold on against this widespread and blinded adaption of technology. Only time will tell how all this pans out, but try we must.

Bhasha Institute Tejgarh

Thanks to ruby's relentless efforts(email chain spread over two months) we finally managed dates with Bhasha Institute

What is this Institute?

Ganesh N Devy setup the institute with the idea to provide tribal communities space where their art, culture, learning, knowledge of their environment could be celebrated, curated and preserved. It is located near the village of tejgarh, Vadodra, Gujrat. At the moment the administration and all operations are carried out by local people and for the local people.

What's the need?

Read and watch this piece on PARI about "tribal girls sing English songs in a village that doesn't speak English, in honor of the potato that they don't eat.". It very accurately depicts the broken education system. Founders and supporters of Bhasha recognized this issue from the start and focused on introducing formal education while keeping it relevant to their local context. Enabling locals to take thing forward, finding people who understand the local issues and are motivated to take the charge to find a possible solution.

How are they doing it?

Bhasha has different "verticals" and they organize workshops regularly for all of them to keep evolving and adapting with exchange of skills between locals and visitors. There is tribal museum for curating local arts, library/publications to document, preserve and publish local knowledge, small medical team with both allopathy and homeopathy treatments available for locals, Shaala aka school which works as a support system to get kids ready for mainstream schooling. In shaala they take local students(aged 8 to 12), belonging to mix of tribes speaking different dialects/languages/bhasha. They have multi-lingual teaching system to get students at ease with different dialects/languages and also introduce formal Gujrati in process to enable kids to read and write. Eventually after 2 years with help from institute they are admitted to schools. Apart from language they also get taught local skills related to farming, folk songs, their own culture.

What was I doing there?

Ruby, Praful and sanket had first hand experience with tribal education at school running in Nelgunda by Hemalksha we had lot of questions on how things were managed and core idea behind institute. There were reflections/discussions in terms of what is different between tribals of different regions and how bhasha as institute is trying to stay relevant. As for me it was mostly observing, institute, activities going on in campus, how ruby, sanket and praful were interacting with local kids(they taught kids two lines from a madia folk song). I wasn't able to contribute back to the local community during this stay, but next time for sure I will.

Using sleepTimeout in JavaScript

tldr: ALWAYS take a brief look at official developer documentation of functions.

I was trying to rush a release over weekend and I had requirement where I was to make repeated API calls to track progress of status of task. Without that any new user would be seeing a "dead page" without any info on what is going on and how he/she should proceed. Pseudo code would be something like:

function get_task_status(task_id) {
  $.get("/get_task_status/", {'task_id':task_id})
    .done( function(data) {
      // Update status div
      // Wait for x seconds and repeat this function

As usual I hurried to Google search the template/pointer code. StackOverflow didn't disappoint and I landed up with this discussion. It had decent insight on using callback function with setTimeout and I cooked up my own version of it:

function get_task_status(task_id) {
  $.get("/get_task_status/", {'task_id':task_id})
    .done( function(data) {
      if(data['task'] === 'PROGRESS') {
	Materialize.toast(data['info'], 1000);
	setTimeout(get_task_status(task_id), 2000);

Looks innocent right? Well that's what got me stumped for almost 3-4 hours. I tried this and my javascript happily ignored setTimeout and delay in seconds and kept making continues GET requests. I tried some variants of above code but nothing worked. Eventually I came across this post on SO, tried the code and it worked! I was convinced that there was some issue with older version handling of setTimeout and 2016 update is what I needed.

Today as I sat to put together a brief note of this experience I was testing setTimeout code on node, browser console, inside HTML template and somehow each time delay was working just fine:

function get_task_status() {
  setTimeout(get_task_status, 2000);
> function get_task_status(task_id) {
...   console.log(Date());
...   // Recursive call to this function itself after 2 seconds of delay
...   setTimeout(get_task_status, 2000);
... }
> get_task_status('something');
Tue Oct 25 2016 15:55:59 GMT+0530 (IST)
> Tue Oct 25 2016 15:56:01 GMT+0530 (IST)
Tue Oct 25 2016 15:56:03 GMT+0530 (IST)
Tue Oct 25 2016 15:56:05 GMT+0530 (IST)
Tue Oct 25 2016 15:56:07 GMT+0530 (IST)
Tue Oct 25 2016 15:56:09 GMT+0530 (IST)
(To exit, press ^C again or type .exit)

Again, this bummed me, I thought I had "established" that setTimeout was broken and promise is what I should be looking at and get better understanding of. While trying to work it out and understand what is wrong as I checked MDN documentation of the function and I finally realized my real bug. Syntax of function is var timeoutID = window.setTimeout(func[, delay, param1, param2, ...]);

And this is what I was doing: setTimeout(get_task_status(task_id), 2000);

Notice in syntax params are after the delay argument while I just used them directly and this was the small gotcha. I was talking to Syed ji about this experience and he pointed to You don't know series for better understanding of javascript concepts and nuances. I learned my lesson, properly RTFM and as for promise, I will return to learn more about it later, at the moment my code is working.


Yesterday punchagan introduced me with PEBKAC - Problem Exist Between Keyboard And Chair, it was a Things I learned(TIL). I had experienced this before many times, only I didn't know that there was such a term.

Starting in April, at TaxSpanner I was given task to integrate ITD Webservices APIs for return filing and other features with our existing stack. The procedure included quite a few alien things to me. Sample codes provided by ITD were in Java, they were using something called SPRING framework, our requests had to be routed via specific proxy approved from ITD and furthermore we physically needed an USB DSC key registered with ITD to encrypt the communication.

As I was trying to get first successful run of API working from my system I was particularly stuck with accessing DSC key from my java code. It needed drivers available here and java security(/etc/java-7-openjdk/security/ file to be edited properly to use correct drivers. After doing these things first thing I tried was to list certificates on the USB token using keytools. And on first run, it worked fine. I was ecstatic, one fewer unknown from the pile of unknowns, right. Wrong, as soon as I tried to run java programs using DSC it threw up lines and lines of error which went something like:

org.springframework.beans.factory.parsing.BeanDefinitionParsingException: Configuration problem: Unable to locate Spring NamespaceHandler for XML schema namespace []
Offending resource: class path resource [ClientConfig.xml]

	at org.springframework.beans.factory.parsing.FailFastProblemReporter.error(
	at org.springframework.beans.factory.parsing.ReaderContext.error(
	at org.springframework.beans.factory.parsing.ReaderContext.error(
	at org.springframework.beans.factory.xml.BeanDefinitionParserDelegate.error(
	at org.springframework.beans.factory.xml.BeanDefinitionParserDelegate.parseCustomElement(
	at org.springframework.beans.factory.xml.BeanDefinitionParserDelegate.parseCustomElement(
	at itd_webs.core.main(Unknown Source)
2016-05-11 12:03:48,114 [main] WARN  org.apache.cxf.bus.spring.SpringBusFactory -  Failed to create application context.
org.springframework.beans.factory.parsing.BeanDefinitionParsingException: Configuration problem: Unable to locate Spring NamespaceHandler for XML schema namespace []
Offending resource: class path resource [ClientConfig.xml]

Getting configs in place so that SPRING framework can load proper credentials from DSC was one another task where inputs from Nandeep proved very crucial. Thankfully we had one more system where this setup with the DSC worked. So this was clear that the particular error related to DSC recognition was just on my system. After lot of head scratching, comparing two systems to identify if something is amiss, trying strace, nothing helped. After scrolling through lot of java related stackoverflow conversations I was playing around with keytools and jdb. As I tried JDB, I noticed DSC blinking and then I thought that the maybe default java was using different configs, I checked my usr/lib/jvm and indeed there were 4 different version of java. I checked `java -version` and it pointed to java version "1.8.065" so instead I tried to compile and run command using /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java and USB blinked happily ever after. While we kept developing and making system stable on computer where things were working, to get it working on my system it almost took two weeks to narrow down to exact issue. And now I have a name for all the time spent, PEBKAC. Thank you punchagan.

Cosine Similarity : Simple yet Powerful Algorithm

I came across Cosine Similarity when I was exploring document clustering (1, 2). It is very simple concept of measuring similarity between two vectors, and this vector can be n-dimensional.

Sometime back Nandeep took me along to NID to help out Masters students with their projects. Kalyani, one of the student was making a haptic device which can help students with hearing disabilities to practice and learn alphabets on their own. From her field visits she got to know that at the moment most of sessions are done in person with the trainer. In these sessions kids were feeling vibrations of trainer's throat to identify how to speak. Her project was around this concept of replicating these vibrations on a physical device along with an App which can "listen" to what students are saying and compare them against some standard audio samples of characters.

Nandeep quickly found this script which used Librosa to compare two audio samples. It was a good start, we were able to get some idea on how we can use these tools for our sample set. When we looked at returned values by MFCC we realized it was a vector. With DTW we tried few custom distance calculations like average, difference between max values etc but cosine similarity gave us the best results. We put together this script, tested it with different samples and checked the performance and kalyani was quite satisfied with it for her prototype.

We got quite lucky with this algorithm fitting in our problem scenario though simple online search engines might not have lead us in this direction.

Using Hekad to parse logs for relevant parts

We at Taxspanner have been looking at different options for analytics pipeline and setting up things to capture relevant information. Vivek had noticed Hekad and he was wondering if we could use syslogs which are already being generated by the app. Idea was to look for specific log in a format which could contain information like App name, Model, UUID and operation being done.

We followed basic setup guide to get a feel of it and it was processing nginx logs in tunes of millions very quickly. Apart from official documentation, this post talks about how to setup a quick filter around Hekad processing pipeline. We experimented with a client-server setup where client running on app server can tail the django log file, filter relevant log message and push it to server hekad instance aggregating logs from all app servers.

This was the client's side hekad config toml file:

maxprocs = 1
base_dir = "."
share_dir = "hekad-location/share/heka/"

# Input is django log file
type = "LogstreamerInput"
splitter = "TokenSplitter"
log_directory = "logs/"
file_match = 'django\.log'

# Decoder to parse logs and extracting relevant log
type = "SandboxFilter"
message_matcher = "Logger == 'django_logs'"
filename = "lua_decoders/django_logs.lua"

# Encoder for output streams

# We channel output generated from DjangoLogDecoder to a certain UDP port
message_matcher = "Logger == 'DjangoLogDecoder'"
address = ":34567"
encoder = "PayloadEncoder"

The Lua script to filter relevant log pretty small:

local string = require "string"
local table = require "table"

-- This structure could be used in better way
local msg = {
Timestamp   = nil,
Type        = msg_type,
Payload     = nil,
Fields      = nil

function process_message ()
    local log = read_message("Payload")
    if log == nil then
      return 0
    local log_blocks = {}
    for i in string.gmatch(log, "%S+") do
      table.insert(log_blocks, i)
    if table.getn(log_blocks) >= 4 then
      if log_blocks[3] == "CRITICAL" then
	msg.Payload = log
    return 0

With client instance in place now we get out listener config sorted out.

maxprocs = 4
base_dir = "."
share_dir = "hekad-location/share/heka/"


# Input listening to port 
type = "UdpInput"
address = ":34567"

# Output channels message received and just prints them
message_matcher = "Logger == 'app_logs'"
encoder = "PayloadEncoder"

And that's it, this will have a basic hekad based pipeline in place which can simply pick information from django logs.

Issues with Indexing while using Cassandra.

We have a single machine cassandra setup on which we are trying different things for analytics. One of Column family we have goes with this discription:

CREATE TABLE playground.event_user_table (
    event_date date,
    event_time timestamp,
    author text,
    content_id text,
    content_model text,
    event_id text,
    event_type text,
    PRIMARY KEY (event_date, event_time)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': ''}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX author ON playground.event_user_table (author);

With this table, we populated data in it for different apps/models. Now when we query system with something like:

cqlsh:playground> select * from event_user_table where event_date = '2013-06-02' ;
 event_date | event_time               | author                                 | content_id                                      | content_model | event_id | event_type
 2013-06-02 | 2013-06-02 00:00:00+0000 |        |        |           A   |          |  submitted
 2013-06-02 | 2013-06-02 01:28:13+0000 |      |                                      1000910424 |           B   |          |     closed
 2013-06-02 | 2013-06-02 01:59:31+0000 |         |         |           A   |          |    created
 2013-06-02 | 2013-06-02 02:00:44+0000 |            |            |           A   |          |    created
 2013-06-02 | 2013-06-02 02:02:16+0000 |       |       |           A   |          |    created

Result looks good and as expected. Now I query system on the secondary index of author and I get empty or partial results:

cqlsh:playground> select * from event_user_table where author = '' ;
 event_date | event_time | author | content_id | content_model | event_id | event_type

(0 rows)

cqlsh:playground> select * from event_user_table where author = '' ;
 event_date | event_time               | author                             | content_id | content_model | event_id | event_type
 2014-01-18 | 2014-01-18 09:01:52+0000 | | 1001068325 |           SRF |          |     closed

(1 rows)

And I have tried this combinations of PRIMARY KEY too ((event_date, event_time), author) but with same results. There are known issues with secondary indexes and scaling1 but it affects single node systems too? I am not sure about it. Time to confirm things.

Update1 <2016-02-10 Wed 15:45>: As mentioned here2, Cassandra has "'lazy' updating to secondary indexes. When you change an indexed value, you need to remove the old value from the index." Could that be the reason?