Python – Lemmatization Approaches with Examples

The following is a step by step guide to exploring various kinds of Lemmatization approaches in python along with a few examples and code implementation. It is highly recommended that you stick to the given flow unless you have an understanding of the topic, in which case you can look up any of the approaches given below.

What is Lemmatization?
In contrast to stemming, lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

For clarity, look at the following examples given below:

Original Word ---> Root Word (lemma)      Feature

   meeting    --->   meet                (core-word extraction)
   was        --->    be                 (tense conversion to present tense)
   mice       --->   mouse               (plural to singular)

TIP: Always convert your text to lowercase before performing any NLP task including lemmatizing.

Various Approaches to Lemmatization:
We will be going over 9 different approaches to perform Lemmatization along with multiple examples and code implementations.



  1. WordNet
  2. WordNet (with POS tag)
  3. TextBlob
  4. TextBlob (with POS tag)
  5. spaCy
  6. TreeTagger
  7. Pattern
  8. Gensim
  9. Stanford CoreNLP

1. Wordnet Lemmatizer 
Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique. 

  • It is present in the nltk library in python.
  • Wordnet links words into semantic relations. ( eg. synonyms )
  • It groups synonyms in the form of synsets.
    • synsets : a group of data elements that are semantically equivalent. 

How to use:

  1. Download nltk package : In your anaconda prompt or terminal, type:
    pip install nltk
  2. Download Wordnet from nltk : In your pyhton console, do the following :
    import nltk
    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger')

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
  
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()
  
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling'
         'driving', 'died', 'tried', 'feet']
for words in list1:
    print(words + " ---> " + wnl.lemmatize(words))
      
#> kites ---> kite
#> babies ---> baby
#> dogs ---> dog
#> flying ---> flying
#> smiling ---> smiling
#> driving ---> driving
#> died ---> died
#> tried ---> tried
#> feet ---> foot

chevron_right


Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# sentence lemmatization examples
string = 'the cat is sitting with the bats on the striped mat under many flying geese'
  
# Converting String into tokens
list2 = nltk.word_tokenize(string)
print(list2)
#> ['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on',
#   'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']
  
lemmatized_string = ' '.join([wnl.lemmatize(words) for words in list2])
  
print(lemmatized_string)   
#> the cat is sitting with the bat on the striped mat under many flying goose

chevron_right


2. Wordnet Lemmatizer (with POS tag)
In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘sitting’, ‘flying’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags. 

We add a tag with a particular word defining its type (verb, noun, adjective etc). 
For Example,

Word      +    Type (POS tag)     —>     Lemmatized Word
driving    +    verb      ‘v’            —>     drive
dogs       +    noun      ‘n’           —>     dog

Code:



filter_none

edit
close

play_arrow

link
brightness_4
code

# WORDNET LEMMATIZER (with appropriate pos tags)
  
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
  
lemmatizer = WordNetLemmatizer()
  
# Define function to lemmatize each word with its POS tag
  
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None
  
sentence = 'the cat is sitting with the bats on the striped mat under many badly flying geese'
  
# tokenize the sentence and find the POS tag for each token
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
  
print(pos_tagged)
#>[('the', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('with', 'IN'), 
# ('the', 'DT'), ('bats', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('striped', 'JJ'), 
# ('mat', 'NN'), ('under', 'IN'), ('many', 'JJ'), ('flying', 'VBG'), ('geese', 'JJ')]
  
# As you may have noticed, the above pos tags are a little confusing.
  
# we use our own pos_tagger function to make things simpler to understand.
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
print(wordnet_tagged)
#>[('the', None), ('cat', 'n'), ('is', 'v'), ('sitting', 'v'), ('with', None), 
# ('the', None), ('bats', 'n'), ('on', None), ('the', None), ('striped', 'a'), 
# ('mat', 'n'), ('under', None), ('many', 'a'), ('flying', 'v'), ('geese', 'a')]
  
lemmatized_sentence = []
for word, tag in wordnet_tagged:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_sentence.append(word)
    else:        
        # else use the tag to lemmatize the token
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)
  
print(lemmatized_sentence)
#> the cat can be sit with the bat on the striped mat under many fly geese

chevron_right


3. TextBlob
TextBlob is a python library used for processing textual data. It provides a simple API to access its methods and perform basic NLP tasks.
Download TextBlob package : In your anaconda prompt or terminal, type:
pip install textblob

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

from textblob import TextBlob, Word
  
my_word = 'cats'
  
# create a Word object
w = Word(my_word)
  
print(w.lemmatize())
#> cat
  
sentence = 'the bats saw the cats with stripes hanging upside down by their feet.'
  
s = TextBlob(sentence)
lemmatized_sentence = " ".join([w.lemmatize() for w in s.words])
  
print(lemmatized_sentence)
#> the bat saw the cat with stripe hanging upside down by their foot

chevron_right


4. TextBlob (with POS tag)
Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the ‘Part of Speech’ tagging to overcome this problem.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

from textblob import TextBlob
  
# Define function to lemmatize each word with its POS tag
  
# POS_TAGGER_FUNCTION : TYPE 2
def pos_tagger(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', "N": 'n', "V": 'v', "R": 'r'}
    words_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemma_list = [wd.lemmatize(tag) for wd, tag in words_tags]
    return lemma_list
  
# Lemmatize
sentence = "the bats saw the cats with stripes hanging upside down by their feet"
lemma_list = pos_tagger(sentence)
lemmatized_sentence = " ".join(lemma_list)
print(lemmatized_sentence)
#> the bat saw the cat with stripe hang upside down by their foot
  
lemmatized_sentence = " ".join([w.lemmatize() for w in t_blob.words])
print(lemmatized_sentence)
#> the bat saw the cat with stripe hanging upside down by their foot

chevron_right


Here is a link for all the types of tag abbreviations with their meanings. (scroll down for the tags table)

5. spaCy
spaCy is an open-source python library that parses and “understands” large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

Download spaCy package :(a) Open anaconda promt or terminal as administrator and run the command:
                pip install -U spacy
                
            (b) Now, open anaconda promt or terminal normally and run the command:
                python -m spacy download en

If successful, you should see a message like:

    Linking successful
    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->
    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en

You can now load the model via nlp = spacy.load('en_core_web_sm')

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import spacy
nlp = spacy.load('en_core_web_sm')
  
# Create a Doc object
doc = nlp(u'the bats saw the cats with best stripes hanging upside down by their feet')
  
# Create list of tokens from given string
tokens = []
for token in doc:
    tokens.append(token)
  
print(tokens)
#> [the, bats, saw, the, cats, with, best, stripes, hanging, upside, down, by, their, feet]
  
lemmatized_sentence = " ".join([token.lemma_ for token in doc])
  
print(lemmatized_sentence)
#> the bat see the cat with good stripe hang upside down by -PRON- foot

chevron_right


In the above code, we observed that this approach was more powerful than our previous approaches as : 

  • Even Pro-nouns were detected. ( identified by -PRON-)
  • Even best was changed to good. 

6. TreeTagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information. The TreeTagger has been successfully used to tag over 25 languages and is adaptable to other languages if a manually tagged training corpus is available.

Word POS Lemma
the DT the
TreeTagger NP TreeTagger
is VBZ be
easy JJ easy
to TO to
use  VB use
. SENT .
How to use: 
1. Download TreeTagger package : In your anaconda prompt or terminal, type:
                      pip install treetaggerwrapper

2. Download TreeTagger Software: Click on TreeTagger and download the software as per your OS. 
(Steps of installation given on website)

Code:



filter_none

edit
close

play_arrow

link
brightness_4
code

# 6. TREETAGGER LEMMATIZER
import pandas as pd
import treetaggerwrapper as tt
  
t_tagger = tt.TreeTagger(TAGLANG ='en', TAGDIR ='C:\Windows\TreeTagger')
  
pos_tags = t_tagger.tag_text("the bats saw the cats with best stripes hanging upside down by their feet")
  
original = []
lemmas = []
tags = []
for t in pos_tags:
    original.append(t.split('\t')[0])
    tags.append(t.split('\t')[1])
    lemmas.append(t.split('\t')[-1])
  
Results = pd.DataFrame({'Original': original, 'Lemma': lemmas, 'Tags': tags})
print(Results)
  
#>      Original  Lemma Tags
# 0       the     the   DT
# 1      bats     bat  NNS
# 2       saw     see  VVD
# 3       the     the   DT
# 4      cats     cat  NNS
# 5      with    with   IN
# 6      best    good  JJS
# 7   stripes  stripe  NNS
# 8   hanging    hang  VVG
# 9    upside  upside   RB
# 10     down    down   RB
# 11       by      by   IN
# 12    their   their  PP$
# 13     feet    foot  NNS

chevron_right


7. Pattern
Pattern is a Python package commonly used for web mining, natural language processing, machine learning and network analysis. It has many useful NLP capabilities. It also contains a special feature which we will be discussing below.

How to use: 
Download Pattern package: In your anaconda prompt or terminal, type:
               pip install pattern

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# PATTERN LEMMATIZER
import pattern
from pattern.en import lemma, lexeme
from pattern.en import parse
  
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
  
lemmatized_sentence = " ".join([lemma(word) for word in sentence.split()])
  
print(lemmatized_sentence)
#> the bat see the cat with best stripe hang upside down by their feet
  
# Special Feature : to get all possible lemmas for each word in the sentence
all_lemmas_for_each_word = [lexeme(wd) for wd in sentence.split()]
print(all_lemmas_for_each_word)
  
#> [['the', 'thes', 'thing', 'thed'], 
#   ['bat', 'bats', 'batting', 'batted'], 
#   ['see', 'sees', 'seeing', 'saw', 'seen'], 
#   ['the', 'thes', 'thing', 'thed'], 
#   ['cat', 'cats', 'catting', 'catted'], 
#   ['with', 'withs', 'withing', 'withed'], 
#   ['best', 'bests', 'besting', 'bested'], 
#   ['stripe', 'stripes', 'striping', 'striped'], 
#   ['hang', 'hangs', 'hanging', 'hung'], 
#   ['upside', 'upsides', 'upsiding', 'upsided'], 
#   ['down', 'downs', 'downing', 'downed'], 
#   ['by', 'bies', 'bying', 'bied'], 
#   ['their', 'theirs', 'theiring', 'theired'], 
#   ['feet', 'feets', 'feeting', 'feeted']]

chevron_right


NOTE : if the above code raises an error saying ‘generator raised StopIteration’. Just run it again. It will work after 3-4 tries.  

8. Gensim
Gensim is designed to handle large text collections using data streaming. Its lemmatization facilities are based on the pattern package we installed above. 

  • gensim.utils.lemmatize() function can be used for performing Lemmatization. This method comes under the utils module in python.
  • We can use this lemmatizer from pattern to extract UTF8-encoded tokens in their base form=lemma.
  • Only considers nouns, verbs, adjectives and adverbs by default (all other lemmas are discarded).
  • For example
Word          --->  Lemmatized Word 
are/is/being  --->  be
saw           --->  see
How to use: 
1. Download Pattern package: In your anaconda prompt or terminal, type:
                  pip install pattern
                  
2. Download Gensim package: Open your anaconda prompt or terminal as administrator and type:
                 pip install -U gensim
                        OR
                 pip3 install -U gensim (if using python3)

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

from gensim.utils import lemmatize
  
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
  
lemmatized_sentence = [word.decode('utf-8').split('.')[0] for word in lemmatize(sentence)]
  
print(lemmatized_sentence)
#> ['bat / NN', 'see / VB', 'cat / NN', 'best / JJ', 
#   'stripe / NN', 'hang / VB', 'upside / RB', 'foot / NN']

chevron_right


NOTE : if the above code raises an error saying ‘generator raised StopIteration‘. Just run it again. It will work after 3-4 tries.  

In the above code as you may have already noticed, the gensim lemmatizer ignore the words like ‘the’, ‘with’, ‘by’ as they did not fall into the 4 lemma categories mentioned above. (noun/verb/adjective/adverb)
9. Stanford CoreNLP
CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, sentiment, quote attributions, and relations. 

  • CoreNLP is your one stop shop for natural language processing in Java!
  • CoreNLP currently supports 6 languages, including Arabic, Chinese, English, French, German, and Spanish.
How to use: 
1. Get JAVA 8 : Download Java 8 (as per your OS) and install it.

2. Get Stanford_coreNLP package : 
    2.1) Download Stanford_CoreNLP and unzip it.                   
    2.2) Open terminal 
                  
    (a) go to the directory where you extracted the above file by doing
    cd C:\Users\...\stanford-corenlp-4.1.0 on terminal
                        
    (b) then, start your Stanford CoreNLP server by executing the following command on terminal: 
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize, ssplit, pos, lemma, parse, sentiment" -port 9000 -timeout 30000
    **(leave your terminal open as long as you use this lemmatizer)** 
    
3. Download Standford CoreNLP package: Open your anaconda prompt or terminal, type:
                        pip install stanfordcorenlp

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

from stanfordcorenlp import StanfordCoreNLP
import json
  
# Connect to the CoreNLP server we just started
nlp = StanfordCoreNLP('http://localhost', port = 9000, timeout = 30000)
  
# Define proporties needed to get lemma
props = {'annotators': 'pos, lemma', 'pipelineLanguage': 'en', 'outputFormat': 'json'}
  
  
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
parsed_str = nlp.annotate(sentence, properties = props)
print(parsed_str)
  
#> "sentences": [{"index": 0,
#  "tokens": [
#        {
#          "index": 1,
#          "word": "the",
#          "originalText": "the",
#          "lemma": "the",           <--------------- LEMMA
#          "characterOffsetBegin": 0,
#          "characterOffsetEnd": 3,
#          "pos": "DT",
#          "before": "",
#          "after": " "
#        },
#        {
#          "index": 2,
#          "word": "bats",
#          "originalText": "bats",
#          "lemma": "bat",           <--------------- LEMMA
#          "characterOffsetBegin": 4,
#          "characterOffsetEnd": 8,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 3,
#          "word": "saw",
#          "originalText": "saw",
#          "lemma": "see",           <--------------- LEMMA
#          "characterOffsetBegin": 9,
#          "characterOffsetEnd": 12,
#          "pos": "VBD",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 4,
#          "word": "the",
#          "originalText": "the",
#          "lemma": "the",          <--------------- LEMMA 
#          "characterOffsetBegin": 13,
#          "characterOffsetEnd": 16,
#          "pos": "DT",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 5,
#          "word": "cats",
#          "originalText": "cats",
#          "lemma": "cat",          <--------------- LEMMA
#          "characterOffsetBegin": 17,
#          "characterOffsetEnd": 21,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 6,
#          "word": "with",
#          "originalText": "with",
#          "lemma": "with",          <--------------- LEMMA
#          "characterOffsetBegin": 22,
#          "characterOffsetEnd": 26,
#          "pos": "IN",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 7,
#          "word": "best",
#          "originalText": "best",
#          "lemma": "best",          <--------------- LEMMA
#          "characterOffsetBegin": 27,
#          "characterOffsetEnd": 31,
#          "pos": "JJS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 8,
#          "word": "stripes",
#          "originalText": "stripes",
#          "lemma": "stripe",          <--------------- LEMMA
#          "characterOffsetBegin": 32,
#          "characterOffsetEnd": 39,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 9,
#          "word": "hanging",
#          "originalText": "hanging",
#          "lemma": "hang",          <--------------- LEMMA
#          "characterOffsetBegin": 40,
#          "characterOffsetEnd": 47,
#          "pos": "VBG",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 10,
#          "word": "upside",
#          "originalText": "upside",
#          "lemma": "upside",          <--------------- LEMMA
#          "characterOffsetBegin": 48,
#          "characterOffsetEnd": 54,
#          "pos": "RB",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 11,
#          "word": "down",
#          "originalText": "down",
#          "lemma": "down",          <--------------- LEMMA
#          "characterOffsetBegin": 55,
#          "characterOffsetEnd": 59,
#          "pos": "RB",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 12,
#          "word": "by",
#          "originalText": "by",
#          "lemma": "by",          <--------------- LEMMA
#          "characterOffsetBegin": 60,
#          "characterOffsetEnd": 62,
#          "pos": "IN",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 13,
#          "word": "their",
#          "originalText": "their",
#          "lemma": "they"#,          <--------------- LEMMA
#          "characterOffsetBegin": 63,
#          "characterOffsetEnd": 68,
#          "pos": "PRP$",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 14,
#          "word": "feet",
#          "originalText": "feet",
#          "lemma": "foot",          <--------------- LEMMA
#          "characterOffsetBegin": 69,
#          "characterOffsetEnd": 73,
#          "pos": "NNS",
#          "before": " ",
#          "after": ""
#        }
#      ]
#    }
#  ]

chevron_right


Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# To get the lemmatized sentece as output
  
# ** RUN THE ABOVE SCRIPT FIRST **
  
lemma_list = []
for item in parsed_dict['sentences'][0]['tokens']:
    for key, value in item.items():
        if key == 'lemma':
            lemma_list.append(value)
          
print(lemma_list)
#> ['the', 'bat', 'see', 'the', 'cat', 'with', 'best', 'stripe', 'hang', 'upside', 'down', 'by', 'they', 'foot']
  
lemmatized_sentence = " ".join(lemma_list)
print(lemmatized_sentence)
#>the bat see the cat with best stripe hang upside down by the foot

chevron_right


Conclusion:
So these are the various Lemmatization approaches that you can refer while working on an NLP project. The selection of the Lemmatization approach is solely dependent upon project requirements. Each approach has its set of pros and cons. Lemmatization is mandatory for critical projects where sentence structure matter like language applications etc.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.