I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one.
My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words. The problem was, it was for words written in Devanagari script not Roman. So I decided to implement the logic for transliterated Hindi. After some refinement, I ended up a large subset of what the paper does, because I wanted to keep the implementation simple.
At the end of a few hours work over a few days, I ended up with one line of code. (I will be paid almost nothing if I were being paid by KLOC written)
re.sub(r'(.{2,}?)([aeiougyn]+$)',r'\1', word)
For people who are regex challenged, the above regex, deletes all vowels along with g,y,n from the end of the word, but leaves at least a 2 character long stem, so that words like ‘aayenga’ do not completely vanish.
The above regex will stem the words as below:
Input word | Stemmed word |
---|---|
Dost_i_ | Dost |
Dost_on_ | Dost |
Bol_iye_ | Bol |
Bol_ungi_ | Bol |
Bol_a_ | Bol |
Ja_na_ | Ja |
Ja_enge_ | Ja |
What do you think? How could this be improved? What edge cases are not considered?