Python | Stemming words with NLTK
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
Prerequisite: Introduction to Stemming
Some more example of stemming for root word "like" include: -> "likes" -> "liked" -> "likely" -> "liking"
Errors in Stemming: There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed from the same root that are of different stems. Under-stemming occurs when two words are stemmed from the same root that is not of different stems.
Applications of stemming are:
- Stemming is used in information retrieval systems like search engines.
- It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.
Below is the implementation of stemming words using NLTK:
program : program programs : program programmer : program programming : program programmers : program
Code #2: Stemming words from sentences
Programmers : program program : program with : with programming : program languages : language