SoFee
After finishing my work with TaxSpanner, I had worked on a personal project, SoFee for around six months. In this post I am documenting idea behind it and what I had expected it to become and where I left it off(till now).
Features
RSS Feed
I have realized that in many of my personal projects, I work broadly around archiving the content I am reading and sharing online, be it news articles, or blogs, or tweet threads or videos. This form of data, feels like sand which keeps slipping away as I try hold it, to keep it fresh, accessible and indexed for reference. And it got triggered after the sunset of Google Reader. Punchagan had introduced me to google reader and I had soon started following lot of people there and used its browser extension to archive the content I was reading. In some way, with SoFee, I was trying to recreate Google Reader experience with the people I was following on twitter. And first iteration of the project was just that, it would give an OPML file which could be added to any feed-reader and I will get separate feed of all the people I am following.
Archiving the content of links
While taking out my feed and data from google-reader, I also noticed that it had preserved content of some of the links. When I tried to access them again, some links were made private and some were no longer available(404). While working on SoFee, I came across the term link-rot and I thought this aspect of preserving the content is crucial, I wanted to archive the content, index it and make it accessible. Often times, I learn or establish some of facts, while reading this content and I wanted it to be referable so that I can revisit it and confirm the origins. I noticed firefox's reader-mode and used its javascript library, Readablity, to extract cleaned up content from the links and add it to the RSS feed I was generating. I also came across Archive.org's Web ARChive/WARC format for storing or archiving web-pages and project using it. I wanted to add this feature to SoFee, so that pages no longer go unavailable and there is individual, archived, intact content available to go back to. In the end after looking at/trying out/playing with different libraries and tools I wasn't able to finish and integrate it.
Personally Trained Model
Last feature which I had thought of including was a personally trained model which can segregate these links into separate groups or categories. Both Facebook and twitter were messing around with timeline, I didn't want that to happen to mine. I wanted a way to control it myself, in a way which suited me. For first step, I separated my timeline from twitter into tweets which had links and others which were just tweet or updates. Secondly, I listed all these tweets in chronological order. With content extracted using Readability, I experimented with unsupervised learning, KMeans, LDA, visualization of results, to create dynamic groups, but results weren't satisfying to be included as feature. For supervised learning, I was thinking of having a default model based on Reddit categories or wikipedia API which can create a generic simpleton data set and then allow user to reinforce and steer the grouping as their liking. Eventually allow users to have a private, personal model which can be applied to any article, news site or source and it will give them the clustering they want. Again, I failed in putting together with this feature.
What I ended up with and future plans
Initially, I didn't want to get into UI and UX and leave that part on popular and established feed-readers. But it slowed down user onboarding and feedback. I eventually ended up with a small web interface where the links and there content were listed and timeline was getting updated every three hour or so. I stopped working on this project as I started working with Senic, and the project kept working for well above an year. Now its non-functional, but I learned a lot while putting together what was working. It is pretty simple project where we can simply charge user the fee for hosting their content on their designated small vps instance or running a lambda service(to fetch updated timeline, apply their model to cluster data), allow them full control of their data(usage, deletion, updation). I will for sure use my learnings to put together more complete version of project with SoFee2.0, lets see when that happens(2019 resolution?).