NLP | Likely Word Tags

nltk.probability.FreqDist is used to find the most common words by counting word frequencies in the treebank corpus. ConditionalFreqDist class is created for tagged words, where we count the frequency of every tag for every word. These counts are then used too construct a model of the frequent words as keys, with the most frequent tag for each word as a value.

Code #1 : Creating function

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Libraries
from nltk.probability import FreqDist, ConditionalFreqDist
  
# Making function
def word_tag_model(words, tagged_words, limit = 200):
      
    fd = FreqDist(words)
    cfd = ConditionalFreqDist(tagged_words)
    most_freq = (word for word, count in fd.most_common(limit))
      
return dict((word, cfd[word].max()) 
             for word in most_freq)

chevron_right


 
Code #2 : Using the function with UnigramTagger



filter_none

edit
close

play_arrow

link
brightness_4
code

# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
  
# intializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
# Initializing the model
model = word_tag_model(treebank.words(), 
                       treebank.tagged_words())
  
# Initializing the Unigram
tag = UnigramTagger(model = model)
  
print ("Accuracy : ", tag.evaluate(test_data))

chevron_right


Output :

Accuracy : 0.559680552557738

 
Code #3 : Let’s try backoff chain

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
  
default_tagger = DefaultTagger('NN')
  
likely_tagger = UnigramTagger(
        model = model, backoff = default_tagger)
  
tag = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger, 
        TrigramTagger], backoff = likely_tagger)
      
print ("Accuracy : ", tag.evaluate(test_data))

chevron_right


Output :

Accuracy : 0.8806820634578028

Note : Backoff chain has increases the accuracy. We can improve this resukt further by effectively using UnigramTagger class.
 
Code #4 : Manual Override of Trained Taggers

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
  
default_tagger = DefaultTagger('NN')
  
tagger = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = default_tagger)
      
likely_tag = UnigramTagger(model = model, backoff = tagger)
  
print ("Accuracy : ", likely_tag.evaluate(test_data))

chevron_right


Output :

Accuracy : 0.8824088063889488


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.