Document clustering in Hindi

As it turns out, since most of clustering algorithms don't depend on grammar or language specific constructs so if we remove stopwords and do a proper stemming, we should be able to use same Major Clust algorithm for hindi blogs too. I looked around for some active hindi blogs and via Indian Bloggers got the link of कस्‍बा/qasba blog maintained by रवीश कुमार जी(Ravish Kumar). But topics covered in that particular blog were mostly related to politics, so I needed some more diverse dataset to test effectiveness of clustering algorithm. I came across feed of Jagran Junction and was able to get feeds related to sports and technology in hindi.

After some cleanups(removing duplicate articles, punctuation, minimum length of article) and using almost same code as we used for clustering English articles I got decent/acceptable results for hindi documents too.

One thing which I had to take care while tokenizing hindi text using default TfidfVectorizer was, default re module for python 2.7 was failing in my case for unicode characters(1, 2) and seems it is fixed in regex. So I used this custom code to get it working.

def regex_tokenizer(doc):
    """Return a function that split a string in sequence of tokens"""
    import regex
    token_pattern=r"(?u)\b\w\w+\b"
    token_pattern = regex.compile(token_pattern)
    return token_pattern.findall(doc)

stop_words = [word.decode("utf-8") for word in stopwords.words("hindi")]
vectorizer = TfidfVectorizer(stop_words = stop_words,
			     tokenizer = regex_tokenizer)
vectorizer.fit_transform(blobs)
features = vectorizer.get_feature_names()

Some additional points

TODO

  • Clean up more, use stemming etc and check the results.
  • Translate the text to English(word by word) and confirm if we get same results for clustering new text.
  • If previous experiment works out, move on to test multiple language document clustering.
  • Analyze important words being returned for a document using tf-idf and lsa.