I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one.
My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words.
The last two weeks, I have been researching options for processing free text. I think I have explored the entire spectrum of possibilities. Below are some notes that I can revisit in a few months and not spend the same effort again.
I was looking at a way to process auto-generated tweets, like the ones on http://twitter.com/moneyvidya_com. Some sample tweets:
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Vijay Shanthi Builders – 6m (Monday 29 March 2010 @ 09:55 … http://bit.ly/bd5JgC
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Bhagwati Banquets & Hotels – 6m (Monday 29 March 2010 @ 09… http://bit.ly/9MzRDG
- #moneyvidya NSV (5 Star rated) says Buy ACC – 6m (Wednesday 24 March 2010 @ 09:55 AM): http://bit.ly/b5xTrN
- #moneyvidya justsurjit (5 Star rated) says Sell Sesa Goa – Intraday (Monday 22 March 2010 @ 10:31 AM): http://bit.ly/9lLo8U
As it is clear, the text follows a specific format, but has its own little variations. I intended to process the ‘insights’ and see each expert’s success rate. Although I never got around actually completing the task, I did learn a lot about text processing.
The apprentice – Regular Expressions
The first approach was the most obvious one – regular expressions. I am sure RegEx would have addressed the particular task at hand. But the parsing expression would become a convoluted mess very soon. So I started looking for better alternatives.
The strict teacher – Lexical Analysis
Lexical analysis starts where regular expression give up. This also needs pretty strict rules on the allowed input text, but the rules could be a lot more flexible and easy to comprehend.
I especially enjoyed using Irony, which makes it trivial to convert BNF formed rules to C# code. There is a good gentle introduction to lexical analysis using Irony on code project.
The guru – Natural Language Processing
Processing test using tools like NLTK, allows you to parse and understand any unstructured text and understand it. Although this gives you maximum freedom, it also needs a lot of work to get right. For this to produce good results, be sure you have lots of data to be able to tweak and test your implementation. I guess this is the reason Google and co., can do so much better at translation, since they have huge data available for improving.
I don’t have one :). I guess, there are several ways to solve a problem, and half the solution is to identify the best way to solve the given problem. As for me, it was a good learning exercise and may come in handy if I ever write a DSL.