Multilingual Document Clustering

After my last post I was following the TODO block to see if I can get a simple multilingual document clustering (MDC) working. First step was to see if we simply translate Hindi documents into English, word by word and then run Clustering algorithm over it, do I get same results as that of clustering documents in their original language.

I used translate python package to convert all documents to English using following code

from translate import Translator
from os import path
import pickle
import string

# adding Hindi punctuation to regular English punctuation
unicode_punctuation = string.punctuation+u"\u0964\u0965"
translate_table = dict((ord(char), u" ") for char in unicode_punctuation)
stop_words = [word.decode("utf-8") for word in stopwords.words("hindi")]
translator = Translator(to_lang="en", from_lang="hi")
if path.isdir("hi_dict"):
    hi_dict = pickle.load(open("hi_dict"))
    hi_dict = {}
for link in links:
    content = link["content"].translate(translate_table)
    content = " ".join(content.split())
    tr_words = []
    for word in content.split():
	if word in stop_words:
	if word in hi_dict:
	elif word.isdigit() or not word.isalpha():
	elif word.istitle() and word.lower() not in hi_dict:
	    print word
	    eng_word = translator.translate(word.encode("utf-8"))
	    hi_dict[word] = eng_word
	    print word, eng_word
    link["content"] = " ".join(tr_words)
pickle.dump(hi_dict, open("hi_dict", "w"))

Here links is simple data structure which stores the urls and text content of those web pages. At this stage, I perform clustering over this translated text. Results I got were in sync with my previous results.

With this hypothesis working I mixed up data(in English) which I used in my first post and this translated data. After shuffling data and playing around with parameters I used following code to run clustering algo.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from random import shuffle
from sklearn.metrics.pairwise import linear_kernel

blobs = [link["content"] for link in links]

vectorizer = TfidfVectorizer(stop_words = "english", ngram_range = (2, 3), max_features = 100000)
corpus_mat = vectorizer.fit_transform(blobs)

cosine_similarities = np.zeros((len(blobs), len(blobs)))
for i in range(len(blobs)):
    cosine_similarities[i] = linear_kernel(corpus_mat[i:i+1], corpus_mat).flatten()
    cosine_similarities[i, i] = 0

# Majorclust algorithm
t = False
indices = np.arange(len(links))
while not t:
    t = True
    for index in np.arange(len(links)):
	# aggregating edge weights 
	new_index = np.argmax(np.bincount(indices, 
	if indices[new_index] != indices[index]:
	    indices[index] = indices[new_index]
	    t = False

# Organizing the end results
clusters = {}
for item, index in enumerate(indices):
    if "title" in links[item]:
	title = links[item]["title"].encode("utf-8")
	clusters.setdefault(index, []).append(title)
	clusters.setdefault(index, []).append(links[item]["url"].encode("utf-8"))
op_file = open("op-mixed", "wb")
for item in clusters:

For this code links contains all translated articles and articles written in english, and this is the complete result.


There are false positives in the results(like kafila posts being part of cluster related to technology) but having results like

गाय माता नहीं नागरिक है
नरेंद्र सत्यवादी मोदी, राहुल सत्यवादी गांधी
मोदी के भाषणों का संकलन
मत बनना उम्मीदवार भाई
तो मेरे गाँव में बिजली आ रही है !
बकरी घास खाती है
क़िस्सा ए शपथ ग्रहण

अब स्काइप, व्हाट्सएप और वाइबर पर चली तलवार
सावधान!! आपकी हर बातचीत और मैसेज का रिकॉर्ड है!!
Google Play: गूगल का हैरान कर देने वाला फीचर

give some hope that with some more work and refined data and better translations we might have something more robust.


  • I will work on better way to test the results, which can give better idea about false positives and performance.
  • To try other algorithms with same dataset and compare results.
  • Extend the idea to include other languages too.
  • If possible create a small proof-of-concept app around it.
  • Release a format of dataset which can be used by others to get started.


Thanks to punchagan and 9 for all the conversations, ideations, suggestions, critiques etc :)