I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one. My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words. The problem was, it was for words written in Devanagari script not Roman.
The last two weeks, I have been researching options for processing free text. I think I have explored the entire spectrum of possibilities. Below are some notes that I can revisit in a few months and not spend the same effort again. Background I was looking at a way to process auto-generated tweets, like the ones on http://twitter.com/moneyvidya_com. Some sample tweets: #moneyvidya arunthestocksguru (5 Star rated) says Buy Vijay Shanthi Builders - 6m (Monday 29 March 2010 @ 09:55 … http://bit.