TL;DR here is public gist

## Introduction

I had a conversation with 9 about clustering links based on content and he pointed me to a StackOverflow(SO) conversation about getting started with it. It is very simple concept, but I had to stare at the pseudo code of paper, StackOverflow, results of different sample-sets, for quite some time to understand what is actually happening.

```
n = 0, t = false
v V do n = n + 1, c(v) = n end
while t = false do
t = true
v V do
c = i if (u, v) is max.
if c(v) != c then c(v) = c , t = false
end
end
```

For example dataset mentioned in StackOverflow

```
texts = [
"foo blub baz",
"foo bar baz",
"asdf bsdf csdf",
"foo bab blub",
"csdf hddf kjtz",
"123 456 890",
"321 890 456 foo",
"123 890 uiop",
]
```

If we try to plot initial graph based on cosine distance among all the nodes(each element of texts list) we get something like this: And once we cluster the graph using MajorClust algorithm, the graph get restructured to:

*Notice how edges get aggregated and new weights are assigned to edges based on clusters and not-so-important edges are dropped
altogether.*

## Code

The code on StackOverflow thread is complete without any external
dependencies. I used *TfidfVectorizer* from sklearn to create
*vector space model* from the documents(different and bigger dataset).

```
# corpus is list of all the documents
vectorizer = TfidfVectorizer(stop_words = "english",
ngram_range = (2, 3),
max_features = 100000)
corpus_mat = vectorizer.fit_transform(corpus)
num_of_samples, num_of_features = corpus_mat.shape
```

Then, I calculate cosine distance of each vector(document) from other vectors.

```
magnitudes = np.zeros((num_of_samples))
# this loop can be removed?
for doc_id in range(num_of_samples):
magnitudes[doc_id] = np.sqrt(corpus_mat[doc_id].dot(corpus_mat[doc_id].T).sum())
cosine_distances = np.zeros((num_of_samples, num_of_samples))
# this loop can be improved
for doc_id, other_id in combinations(range(num_of_samples), 2):
distance = (corpus_mat[doc_id].dot(corpus_mat[other_id].T).sum())/(magnitudes[doc_id]*magnitudes[other_id])
cosine_distances[doc_id, other_id] = cosine_distances[other_id, doc_id] = distance
```

This *cosine _{distances}* matrix is adjacency matrix representation of
the graph, where each node is one document and edge weight(cosine
distance) being the measure of similarity among two documents. Now
MajorClust algorithm is clustering based on aggregating/accumulating
edge weights and assigning each node to cluster which has maximum
accumulated edge weight.

```
t = False
indices = np.arange(num_of_samples)
while not t:
t = True
for index in np.arange(num_of_samples):
# aggregating edge weights
new_index = np.argmax(np.bincount(indices,
weights=cosine_distances[index]))
if indices[new_index] != indices[index]:
indices[index] = indices[new_index]
t = False
clusters = {}
for item, index in enumerate(indices):
clusters.setdefault(index, []).append(links[-50+item]["url"])
```

This bit might be confusing, I was stuck here for quite some time, maybe images of graph above might help.

## Experimentation and results

For testing the performance of algorithm I took facebook pages of three websites techcrunch, arstechnica, kafila(two of them cover articles related to tech, science and third one is about Indian politics and media). Using fbconsole I got last 50 links(only english stories) which have been shared, scraped content of those links and ran MajorClust over this data. I had to play around with Tf-Idf parameters like using stop words, ngrams(2 and 3) to get decent results. Once I had it, I ran it on single group(kafila) to create more clusters out of posted stories.

## Next..

- Try to profile FB friend circle based on links they are sharing.
- Other clustering algorithms, k-means etc.
- Running this algorithm with different situations(like streaming data).