Skip to content
Related Articles

Related Articles

Improve Article
Snowball Stemmer – NLP
  • Last Updated : 14 Oct, 2020

Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

First, let’s look at what is stemming-

Stemming: It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. Stemming is important in natural language processing(NLP).

Some few common rules of Snowball stemming are:

Few Rules:
ILY  -----> ILI
LY   -----> Nill
SS   -----> SS
S    -----> Nill
ED   -----> E,Nill
  • Nill means the suffix is replaced with nothing and is just removed.
  • There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. One more interesting thing is in the word ‘stemmed‘ it is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word.

Let’s see a few examples:-



Word           Stem
cared          care
university     univers
fairly         fair
easily         easili
singing        sing
sings          sing
sung           sung
singer         singer
sportingly     sport

Code: Python code implementation of Snowball Stemmer using NLTK library




import nltk
from nltk.stem.snowball import SnowballStemmer
  
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')
  
#list of tokenized words
words = ['cared','university','fairly','easily','singing',
       'sings','sung','singer','sportingly']
  
#stem's of each word
stem_words = []
for w in words:
    x = snow_stemmer.stem(w)
    stem_words.append(x)
      
#print stemming results
for e1,e2 in zip(words,stem_words):
    print(e1+' ----> '+e2)

Output:

cared ----> care
university ----> univers
fairly ----> fair
easily ----> easili
singing ----> sing
sings ----> sing
sung ----> sung
singer ----> singer
sportingly ----> sport

You can also quickly check what stem would be returned for a given word or words using the snowball site. Under its demo section, you can easily see what this algorithm does for various different words.

Other Stemming Algorithms:

  • Porter Stemmer: This is an old stemming algorithm which was developed by Martin Porter in 1980. As compared to other algorithms it is a very gentle stemming algorithm.
  • Lancaster Stemmer: It is the most aggressive stemming algorithm. We can also add our own custom rules in this algorithm when we implement this using the NLTK package. Since it’s aggressive it can sometimes give strange stems as well.

There are other stemming algorithms as well.

Difference Between Porter Stemmer and Snowball Stemmer:

  • Snowball Stemmer is more aggressive than Porter Stemmer.
  • Some issues in Porter Stemmer were fixed in Snowball Stemmer.
  • There is only a little difference in the working of these two.
  • Words like ‘fairly‘ and ‘sportingly‘ were stemmed to ‘fair’ and ‘sport’ in the snowball stemmer but when you use the porter stemmer they are stemmed to ‘fairli‘ and ‘sportingli‘.
  • The difference between the two algorithms can be clearly seen in the way the word ‘Sportingly’ in stemmed by both. Clearly Snowball Stemmer stems it to a more accurate stem.

Drawbacks of Stemming:

  • Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems.
  • Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context. For this reason, Lemmatization is used as it keeps this fact in consideration and will return either ‘see’ or ‘saw’ depending on whether the word ‘saw’ was used as a verb or a noun.

machine-learning-img




My Personal Notes arrow_drop_up
Recommended Articles
Page :