Multilingual Document Clustering
After my last post I was following the TODO block to see if I can get a simple multilingual document clustering (MDC) working. First step was to see if we simply translate Hindi documents into English, word by word and then run Clustering algorithm over it, do I get same results as that of clustering documents in their original language.
I used translate python package to convert all documents to English using following code
from translate import Translator
from os import path
import pickle
import string
# adding Hindi punctuation to regular English punctuation
unicode_punctuation = string.punctuation+u"\u0964\u0965"
translate_table = dict((ord(char), u" ") for char in unicode_punctuation)
stop_words = [word.decode("utf-8") for word in stopwords.words("hindi")]
translator = Translator(to_lang="en", from_lang="hi")
if path.isdir("hi_dict"):
hi_dict = pickle.load(open("hi_dict"))
else:
hi_dict = {}
for link in links:
content = link["content"].translate(translate_table)
content = " ".join(content.split())
tr_words = []
for word in content.split():
if word in stop_words:
continue
if word in hi_dict:
tr_words.append(hi_dict[word])
elif word.isdigit() or not word.isalpha():
tr_words.append(word)
elif word.istitle() and word.lower() not in hi_dict:
print word
tr_words.append(word)
else:
eng_word = translator.translate(word.encode("utf-8"))
hi_dict[word] = eng_word
print word, eng_word
tr_words.append(eng_word)
link["content"] = " ".join(tr_words)
pickle.dump(hi_dict, open("hi_dict", "w"))
Here links is simple data structure which stores the urls and text content of those web pages. At this stage, I perform clustering over this translated text. I got were in sync with my previous .
With this hypothesis working I mixed up data(in English) which I used in my first post and this translated data. After shuffling data and playing around with parameters I used following code to run clustering algo.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from random import shuffle
from sklearn.metrics.pairwise import linear_kernel
shuffle(links)
blobs = [link["content"] for link in links]
vectorizer = TfidfVectorizer(stop_words = "english", ngram_range = (2, 3), max_features = 100000)
corpus_mat = vectorizer.fit_transform(blobs)
cosine_similarities = np.zeros((len(blobs), len(blobs)))
for i in range(len(blobs)):
cosine_similarities[i] = linear_kernel(corpus_mat[i:i+1], corpus_mat).flatten()
cosine_similarities[i, i] = 0
# Majorclust algorithm
t = False
indices = np.arange(len(links))
while not t:
t = True
for index in np.arange(len(links)):
# aggregating edge weights
new_index = np.argmax(np.bincount(indices,
weights=cosine_similarities[index]))
if indices[new_index] != indices[index]:
indices[index] = indices[new_index]
t = False
# Organizing the end results
clusters = {}
for item, index in enumerate(indices):
if "title" in links[item]:
title = links[item]["title"].encode("utf-8")
clusters.setdefault(index, []).append(title)
else:
clusters.setdefault(index, []).append(links[item]["url"].encode("utf-8"))
op_file = open("op-mixed", "wb")
for item in clusters:
op_file.write("\n"+80*"="+"\n")
op_file.write("\n".join(clusters[item]))
op_file.close()
For this code links contains all translated articles and articles written in english, and this is the complete .
Inferences
There are false positives in the results(like kafila posts being part of cluster related to technology) but having results like
,
http://kafila.org/2013/12/08/aap-halts-bjp-advance-in-delhi/
!
http://kafila.org/2013/11/15/first-terrorist-of-independent-india/
and
http://arstechnica.com/tech-policy/2013/12/nsa-collects-nearly-5-billion-cellphone-location-records-per-day/
,
!! !!
http://arstechnica.com/tech-policy/2013/12/microsoft-google-apple-call-for-end-to-nsas-bulk-data-collection/
Google Play:
http://techcrunch.com/2013/12/11/google-android-device-manager-play-store/?utm_campaign=fb&%3Fncid=fb
give some hope that with some more work and refined data and better translations we might have something more robust.
TODO
- I will work on better way to test the results, which can give better idea about false positives and performance.
- To try other algorithms with same dataset and compare results.
- Extend the idea to include other languages too.
- If possible create a small proof-of-concept app around it.
- Release a format of dataset which can be used by others to get started.