Stemming transliterated Hindi

I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one.

My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words.
The problem was, it was for words written in Devanagari script not Roman. So I decided to implement the logic for transliterated Hindi. After some refinement, I ended up a large subset of what the paper does, because I wanted to keep the implementation simple.

At the end of a few hours work over a few days, I ended up with one line of code. (I will be paid almost nothing if I were being paid by KLOC written)

re.sub(r'(.{2,}?)([aeiougyn]+$)',r'\1', word)

For people who are regex challenged, the above regex, deletes all vowels along with g,y,n from the end of the word, but leaves at least a 2 character long stem, so that words like ‘aayenga’ do not completely vanish.

The above regex will stem the words as below:

Input word Stemmed word
Dosti Dost
Doston Dost
Boliye Bol
Bolungi Bol
Bola Bol
Jana Ja
Jaenge Ja

What do you think? How could this be improved? What edge cases are not considered?

1 comment

  1. It was nice to see that you have exactly used the same logic what I had thought regarding hindi stemming, first transliterating then stemming. Just go ahead and even I am thinking in the same direction. We will definitely come out with some nice solutions

Leave a Reply

Your email address will not be published. Required fields are marked *