Clustering text using MajorClust
  |   Source

TL;DR here is public gist

Introduction

I had a conversation with 9 about clustering links based on content and he pointed me to a StackOverflow(SO) conversation about getting started with it. It is very simple concept, but I had to stare at the pseudo code of paper, StackOverflow, results of different sample-sets, for quite some time to understand what is actually happening.

n = 0, t = false
 v   V do n = n + 1, c(v) = n end
while t = false do
  t = true  
   v   V do
    c  = i if  (u, v) is max.
    if c(v) != c  then c(v) = c  , t = false
  end
end

For example dataset mentioned in StackOverflow

texts = [
  "foo blub baz",
  "foo bar baz",
  "asdf bsdf csdf",
  "foo bab blub",
  "csdf hddf kjtz",
  "123 456 890",
  "321 890 456 foo",
  "123 890 uiop",
]

If we try to plot initial graph based on cosine distance among all the nodes(each element of texts list) we get something like this: original_graph.png And once we cluster the graph using MajorClust algorithm, the graph get restructured to: clustered_graph.png

Notice how edges get aggregated and new weights are assigned to edges based on clusters and not-so-important edges are dropped altogether.

Code

The code on StackOverflow thread is complete without any external dependencies. I used TfidfVectorizer from sklearn to create vector space model from the documents(different and bigger dataset).

# corpus is list of all the documents
vectorizer = TfidfVectorizer(stop_words = "english", 
			     ngram_range = (2, 3),
			     max_features = 100000)
corpus_mat = vectorizer.fit_transform(corpus)
num_of_samples, num_of_features = corpus_mat.shape

Then, I calculate cosine distance of each vector(document) from other vectors.

magnitudes = np.zeros((num_of_samples))
# this loop can be removed?
for doc_id in range(num_of_samples):
  magnitudes[doc_id] = np.sqrt(corpus_mat[doc_id].dot(corpus_mat[doc_id].T).sum())

cosine_distances = np.zeros((num_of_samples, num_of_samples))
# this loop can be improved
for doc_id, other_id in combinations(range(num_of_samples), 2):
  distance = (corpus_mat[doc_id].dot(corpus_mat[other_id].T).sum())/(magnitudes[doc_id]*magnitudes[other_id])
  cosine_distances[doc_id, other_id] = cosine_distances[other_id, doc_id] = distance

This cosinedistances matrix is adjacency matrix representation of the graph, where each node is one document and edge weight(cosine distance) being the measure of similarity among two documents. Now MajorClust algorithm is clustering based on aggregating/accumulating edge weights and assigning each node to cluster which has maximum accumulated edge weight.

t = False
indices = np.arange(num_of_samples)
while not t:
  t = True
  for index in np.arange(num_of_samples):
    # aggregating edge weights 
    new_index = np.argmax(np.bincount(indices, 
			  weights=cosine_distances[index]))
    if indices[new_index] != indices[index]:
      indices[index] = indices[new_index]
      t = False

clusters = {}
for item, index in enumerate(indices):
  clusters.setdefault(index, []).append(links[-50+item]["url"])

This bit might be confusing, I was stuck here for quite some time, maybe images of graph above might help.

Experimentation and results

For testing the performance of algorithm I took facebook pages of three websites techcrunch, arstechnica, kafila(two of them cover articles related to tech, science and third one is about Indian politics and media). Using fbconsole I got last 50 links(only english stories) which have been shared, scraped content of those links and ran MajorClust over this data. I had to play around with Tf-Idf parameters like using stop words, ngrams(2 and 3) to get decent results. Once I had it, I ran it on single group(kafila) to create more clusters out of posted stories.

Next..

  • Try to profile FB friend circle based on links they are sharing.
  • Other clustering algorithms, k-means etc.
  • Running this algorithm with different situations(like streaming data).