Categories
Technology

Stemming transliterated Hindi

I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one.

My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words. The problem was, it was for words written in Devanagari script not Roman. So I decided to implement the logic for transliterated Hindi. After some refinement, I ended up a large subset of what the paper does, because I wanted to keep the implementation simple.

At the end of a few hours work over a few days, I ended up with one line of code. (I will be paid almost nothing if I were being paid by KLOC written)

re.sub(r'(.{2,}?)([aeiougyn]+$)',r'\1', word)

For people who are regex challenged, the above regex, deletes all vowels along with g,y,n from the end of the word, but leaves at least a 2 character long stem, so that words like ‘aayenga’ do not completely vanish.

The above regex will stem the words as below:

Input word Stemmed word
Dost_i_ Dost
Dost_on_ Dost
Bol_iye_ Bol
Bol_ungi_ Bol
Bol_a_ Bol
Ja_na_ Ja
Ja_enge_ Ja

What do you think? How could this be improved? What edge cases are not considered?