NLP | Likely Word Tags
Last Updated :
14 Apr, 2022
nltk.probability.FreqDist is used to find the most common words by counting word frequencies in the treebank corpus. ConditionalFreqDist class is created for tagged words, where we count the frequency of every tag for every word. These counts are then used too construct a model of the frequent words as keys, with the most frequent tag for each word as a value. Code #1 : Creating function
Python3
from nltk.probability import FreqDist, ConditionalFreqDist
def word_tag_model(words, tagged_words, limit = 200 ):
fd = FreqDist(words)
cfd = ConditionalFreqDist(tagged_words)
most_freq = (word for word, count in fd.most_common(limit))
return dict ((word, cfd[word]. max ())
for word in most_freq)
|
Code #2 : Using the function with UnigramTagger
Python3
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
train_data = treebank.tagged_sents()[: 3000 ]
test_data = treebank.tagged_sents()[ 3000 :]
model = word_tag_model(treebank.words(),
treebank.tagged_words())
tag = UnigramTagger(model = model)
print ("Accuracy : ", tag.evaluate(test_data))
|
Output :
Accuracy : 0.559680552557738
Code #3 : Let’s try backoff chain
Python3
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger( 'NN' )
likely_tagger = UnigramTagger(
model = model, backoff = default_tagger)
tag = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = likely_tagger)
print ("Accuracy : ", tag.evaluate(test_data))
|
Output :
Accuracy : 0.8806820634578028
Note : Backoff chain has increases the accuracy. We can improve this result further by effectively using UnigramTagger class. Code #4 : Manual Override of Trained Taggers
Python3
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger( 'NN' )
tagger = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = default_tagger)
likely_tag = UnigramTagger(model = model, backoff = tagger)
print ("Accuracy : ", likely_tag.evaluate(test_data))
|
Output :
Accuracy : 0.8824088063889488
Share your thoughts in the comments
Please Login to comment...