Stemming transliterated Hindi

I needed a library which could stem Hindi words written in roman script (transliterated), but could not find one.

My search took me to Lucene’s HindiStemmer, which in turn led me to the paper by Ananthakrishnan Ramanathan and Durgesh D Rao: A Lightweight Stemmer for Hindi [PDF]. It was a good initiation to how some simple rules could stem most Hindi words.
The problem was, it was for words written in Devanagari script not Roman. So I decided to implement the logic for transliterated Hindi. After some refinement, I ended up a large subset of what the paper does, because I wanted to keep the implementation simple.

At the end of a few hours work over a few days, I ended up with one line of code. (I will be paid almost nothing if I were being paid by KLOC written)

re.sub(r'(.{2,}?)([aeiougyn]+$)',r'\1', word)

For people who are regex challenged, the above regex, deletes all vowels along with g,y,n from the end of the word, but leaves at least a 2 character long stem, so that words like ‘aayenga’ do not completely vanish.

The above regex will stem the words as below:

Input word Stemmed word
Dosti Dost
Doston Dost
Boliye Bol
Bolungi Bol
Bola Bol
Jana Ja
Jaenge Ja

What do you think? How could this be improved? What edge cases are not considered?

By Hitesh

Hi, I am your host, Hitesh. I am a tech enthusiast and dabble in a variety of subjects. Connect with me on Twitter or LinkedIn.

3 replies on “Stemming transliterated Hindi”

It was nice to see that you have exactly used the same logic what I had thought regarding hindi stemming, first transliterating then stemming. Just go ahead and even I am thinking in the same direction. We will definitely come out with some nice solutions

Kudos for your work. Coming up with the solution is great. However this stemmer is a not really a useful one. Here: I ran your stemmer through 50 words, It does not give to best results.
sun -> su
main -> ma
hoon -> ho
thoda -> thod
sanki -> sank
karun -> kar
mann -> ma
ki -> ki
baby -> bab
gaana -> ga
lagade -> lagad
funky -> funk
nahi -> nah
dhan -> dh
ye -> ye
baat -> baat
hai -> ha
tere -> ter
tann -> ta
paagal -> paagal
ho -> ho
jaaun -> ja
jab -> jab
tu -> tu
rubaru -> rubar
na -> na
lamba -> lamb
ocha -> och
gora -> gor
chitta -> chitt
phir -> phir
bhi -> bh
dil -> dil
mein -> me
ishq -> ishq
ne -> ne
kiya -> ki
bekaboo -> bekab
jaisa -> jais
waisa -> wais
hi -> hi
pasand -> pasand
mujhko -> mujhk
jaanu -> ja
i -> i
just -> just
wanna -> wa
feel -> feel
your -> your
body -> bod

Comments are closed.