My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words.
The last two weeks, I have been researching options for processing free text. I think I have explored the entire spectrum of possibilities. Below are some notes that I can revisit in a few months and not spend the same effort again.
I was looking at a way to process auto-generated tweets, like the ones on http://twitter.com/moneyvidya_com. Some sample tweets:
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Vijay Shanthi Builders – 6m (Monday 29 March 2010 @ 09:55 … http://bit.ly/bd5JgC
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Bhagwati Banquets & Hotels – 6m (Monday 29 March 2010 @ 09… http://bit.ly/9MzRDG
- #moneyvidya NSV (5 Star rated) says Buy ACC – 6m (Wednesday 24 March 2010 @ 09:55 AM): http://bit.ly/b5xTrN
- #moneyvidya justsurjit (5 Star rated) says Sell Sesa Goa – Intraday (Monday 22 March 2010 @ 10:31 AM): http://bit.ly/9lLo8U
As it is clear, the text follows a specific format, but has its own little variations. I intended to process the ‘insights’ and see each expert’s success rate. Although I never got around actually completing the task, I did learn a lot about text processing.
The apprentice – Regular Expressions
The first approach was the most obvious one – regular expressions. I am sure RegEx would have addressed the particular task at hand. But the parsing expression would become a convoluted mess very soon. So I started looking for better alternatives.
The strict teacher – Lexical Analysis
Lexical analysis starts where regular expression give up. This also needs pretty strict rules on the allowed input text, but the rules could be a lot more flexible and easy to comprehend.
The guru – Natural Language Processing
Processing test using tools like NLTK, allows you to parse and understand any unstructured text and understand it. Although this gives you maximum freedom, it also needs a lot of work to get right. For this to produce good results, be sure you have lots of data to be able to tweak and test your implementation. I guess this is the reason Google and co., can do so much better at translation, since they have huge data available for improving.
I don’t have one :). I guess, there are several ways to solve a problem, and half the solution is to identify the best way to solve the given problem. As for me, it was a good learning exercise and may come in handy if I ever write a DSL.