Open In App

Snowball Stemmer – NLP

Last Updated : 27 Dec, 2022
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

First, let’s look at what is stemming-

Stemming: It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. Stemming is important in natural language processing(NLP).

Some few common rules of Snowball stemming are:

Few Rules:
ILY  -----> ILI
LY   -----> Nil
SS   -----> SS
S    -----> Nil
ED   -----> E,Nil
  • Nil means the suffix is replaced with nothing and is just removed.
  • There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. One more interesting thing is in the word ‘stemmed‘ it is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word.

Let’s see a few examples:-

Word           Stem
cared          care
university     univers
fairly         fair
easily         easili
singing        sing
sings          sing
sung           sung
singer         singer
sportingly     sport

Code: Python code implementation of Snowball Stemmer using NLTK library

python3




import nltk
from nltk.stem.snowball import SnowballStemmer
 
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')
 
#list of tokenized words
words = ['cared','university','fairly','easily','singing',
       'sings','sung','singer','sportingly']
 
#stem's of each word
stem_words = []
for w in words:
    x = snow_stemmer.stem(w)
    stem_words.append(x)
     
#print stemming results
for e1,e2 in zip(words,stem_words):
    print(e1+' ----> '+e2)


Output:

cared ----> care
university ----> univers
fairly ----> fair
easily ----> easili
singing ----> sing
sings ----> sing
sung ----> sung
singer ----> singer
sportingly ----> sport

You can also quickly check what stem would be returned for a given word or words using the snowball site. Under its demo section, you can easily see what this algorithm does for various different words.

Other Stemming Algorithms:

  • Porter Stemmer: This is an old stemming algorithm which was developed by Martin Porter in 1980. As compared to other algorithms it is a very gentle stemming algorithm.
  • Lancaster Stemmer: It is the most aggressive stemming algorithm. We can also add our own custom rules in this algorithm when we implement this using the NLTK package. Since it’s aggressive it can sometimes give strange stems as well.

There are other stemming algorithms as well.

Difference Between Porter Stemmer and Snowball Stemmer:

  • Snowball Stemmer is more aggressive than Porter Stemmer.
  • Some issues in Porter Stemmer were fixed in Snowball Stemmer.
  • There is only a little difference in the working of these two.
  • Words like ‘fairly‘ and ‘sportingly‘ were stemmed to ‘fair’ and ‘sport’ in the snowball stemmer but when you use the porter stemmer they are stemmed to ‘fairli‘ and ‘sportingli‘.
  • The difference between the two algorithms can be clearly seen in the way the word ‘Sportingly’ in stemmed by both. Clearly Snowball Stemmer stems it to a more accurate stem.

Drawbacks of Stemming:

  • Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems.
  • Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context. For this reason, Lemmatization is used as it keeps this fact in consideration and will return either ‘see’ or ‘saw’ depending on whether the word ‘saw’ was used as a verb or a noun.


Similar Reads

NLP | Classifier-based Chunking | Set 2
Using the data from the treebank_chunk corpus let us evaluate the chunkers (prepared in the previous article). Code #1 : C/C++ Code # loading libraries from chunkers import ClassifierChunker from nltk.corpus import treebank_chunk train_data = treebank_chunk.chunked_sents()[:3000] test_data = treebank_chunk.chunked_sents()[3000:] # initializing chun
2 min read
BART Model for Text Auto Completion in NLP
BART stands for Bidirectional and Auto-Regressive Transformer. It is a denoising autoencoder that is a pre-trained sequence-to-sequence method, that uses masked language modeling for Natural Language Generation and Translation. It is developed by Lewis et al. in 2019. BART architecture is similar to an encoder-decoder network except that it uses a
7 min read
Processing text using NLP | Basics
In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and programs written for processing of English language writt
2 min read
Readability Index in Python(NLP)
Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax). It focuses on the words we choose, and how we put them into sentences and paragraphs for the readers to comprehend. Our main objective in writing is to pass alo
6 min read
Feature Extraction Techniques - NLP
Introduction : This article focuses on basic feature extraction techniques in NLP to analyse the similarities between pieces of text. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of com
11 min read
Bidirectional LSTM in NLP
In this article, we will first discuss bidirectional LSTMs and their architecture. We will then look into the implementation of a review system using Bidirectional LSTM. Finally, we will conclude this article while discussing the applications of bidirectional LSTM. Bidirectional LSTM (BiLSTM)Bidirectional LSTM or BiLSTM is a term used for a sequenc
8 min read
NLP | Chunking and chinking with RegEx
Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech). Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can't be a part of chuck and such words are known as chinks. A ChunkRule cla
2 min read
NLP | Training Unigram Tagger
A single token is referred to as a Unigram, for example - hello; movie; coding. This article is focused on unigram tagger. Unigram Tagger: For determining the Part of Speech tag, it only uses a single word. UnigramTagger inherits from NgramTagger, which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger. So, UnigramTagger i
2 min read
NLP | Synsets for a word in WordNet
WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing. Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Syn
2 min read
NLP | Word Collocations
Collocations are two or more words that tend to appear frequently together, for example - United States. There are many other words that can come after United, such as the United Kingdom and United Airlines. As with many aspects of natural language processing, context is very important. And for collocations, context is everything. In the case of co
3 min read
NLP | WuPalmer - WordNet Similarity
How does Wu & Palmer Similarity work? It calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer). The score can be 0 < score <= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of taxonomy is one).
2 min read
NLP | Splitting and Merging Chunks
SplitRule class : It splits a chunk based on the specified split pattern for the purpose. It is specified like <NN.*>}{<.*> i.e. two opposing curly braces surrounded by a pattern on either side. MergeRule class : It merges two chunks together based on the ending of the first chunk and the beginning of the second chunk. It is specified l
2 min read
NLP | Chunking Rules
Below are the steps involved for Chunking - Conversion of sentence to a flat tree. Creation of Chunk string using this tree.Creation of RegexpChunkParser by parsing the grammar using RegexpParser.Applying the created chunk rule to the ChunkString that matches the sentence into a chunk. Splitting the bigger chunk to a smaller chunk using the defined
2 min read
NLP | Leacock Chordorow (LCH) and Path similarity for Synset
Path-based Similarity: It is a similarity measure that finds the distance that is the length of the shortest path between two synsets. Leacock Chordorow (LCH) : It is a similarity measure which is an extended version of Path-based similarity as it incorporates the depth of the taxonomy. Therefore, it is the negative log of the shortest path (spath)
1 min read
NLP | Part of speech tagged - word corpus
What is Part-of-speech (POS) tagging ? It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on. Example of Part-of-speech (POS) tagged corpus The/at-tl expense/nn
2 min read
NLP | Categorized Text Corpus
If we have a large number of text data, then one can categorize it to separate sections. Code #1 : Categorization C/C++ Code # Loading brown corpus from nltk.corpus import brown brown.categories() Output : ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'rev
1 min read
NLP | IOB tags
What are Chunks? Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. What are IOB tags? It is a format for chunks. These tags are similar to part-of-speech tags but can denote the inside, outside, and be
3 min read
NLP | Chunking using Corpus Reader
What are Chunks? These are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.How it works : The ChunkedCorpusReader class works
2 min read
NLP | Customization Using Tagged Corpus Reader
How we can use Tagged Corpus Reader ? Customizing word tokenizerCustomizing sentence tokenizerCustomizing paragraph block readerCustomizing tag separatorConverting tags to a universal tagset Code #1 : Customizing word tokenizer C/C++ Code # Loading the libraries from nltk.tokenize import SpaceTokenizer from nltk.corpus.reader import TaggedCorpusRea
2 min read
NLP | Wordlist Corpus
What is a corpus? A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files. How to create wordlist corpus? WordListCorpusReader class is one of the simplest CorpusReader classes. It WordListCorpusReader - It is one of the simplest
2 min read
NLP | Custom corpus
What is a corpus? A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files. How it is done ? NLTK already defines a list of data paths or directories in nltk.data.path. Our custom corpora must be present within any of these given p
2 min read
NLP | Brill Tagger
BrillTagger class is a transformation-based tagger. It is not a subclass of SequentialBackoffTagger. Moreover, it uses a series of rules to correct the results of an initial tagger. These rules it follows are scored based. This score is equal to the no. of errors they correct minus the no. of new errors they produce. Code #1 : Training a BrillTagge
2 min read
NLP | Regex and Affix tagging
Regular expression matching is used to tag words. Consider the example, numbers can be matched with \d to assign the tag CD (which refers to a Cardinal number). Or one can match the known word patterns, such as the suffix "ing". Understanding the concept - RegexpTagger is a subclass of SequentialBackoffTagger. It can be positioned before a DefaultT
3 min read
NLP | Likely Word Tags
nltk.probability.FreqDist is used to find the most common words by counting word frequencies in the treebank corpus. ConditionalFreqDist class is created for tagged words, where we count the frequency of every tag for every word. These counts are then used too construct a model of the frequent words as keys, with the most frequent tag for each word
2 min read
NLP | Combining NGram Taggers
NgramTagger has 3 subclasses UnigramTagger BigramTagger TrigramTagger BigramTagger subclass uses previous tag as part of its context TrigramTagger subclass uses the previous two tags as part of its context. ngram - It is a subsequence of n items. Idea of NgramTagger subclasses : By looking at the previous words and P-O-S tags, part-of-speech tag fo
2 min read
NLP | Backoff Tagging to combine taggers
What is Part-of-speech (POS) tagging ? It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on. What is Backoff Tagging? It is one of the most important features
3 min read
NLP | Partial parsing with Regex
Defining a grammar to parse 3 phrase types. ChunkRule class that looks for an optional determiner followed by one or more nouns is used for noun phrases. To add an adjective to the front of a noun chunk, MergeRule class is used. Any IN word is simply chunked for the prepositional phrases. an optional modal word (such as should) followed by a verb i
2 min read
NLP | Classifier-based tagging
ClassifierBasedPOSTagger class: It is a subclass of ClassifierBasedTagger that uses classification technique to do part-of-speech tagging. From the words, features are extracted and then passed to an internal classifier. It classifies the features and returns a label i.e. a part-of-speech tag. The feature detector finds multiple length suffixes, do
2 min read
NLP | Trigrams'n'Tags (TnT) Tagging
TnT Tagger : It is a statistical tagger that works on second-order Markov models. It is a very efficient part-of-speech tagger that can be trained on different languages and on any tagset. For parameter generation, the component trains on tagged corpora. It incorporates different methods of smoothing and handling unknown words Linear interpolation
3 min read
NLP | WordNet for tagging
WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing. Code #1 : Creating class to look up words in WordNet. from nltk.tag import SequentialBackoffTagger from nltk.corpus import wordnet from nltk.probability import FreqDist class WordNetTagger(SequentialBackoffTagger): '''
2 min read
Article Tags :
Practice Tags :