NLP | Trigrams’n’Tags (TnT) Tagging

TnT Tagger : It is a statistical tagger that works on second-order Markov models.

It is a very efficient part-of-speech tagger that can be trained on different languages and on any tagset.
For parameter generation, the component trains on tagged corpora. It incorporates different methods of smoothing and handling unknown words
Linear interpolation is used for smoothing, the respective weights are determined by deleted interpolation.

TnT tagger has different API than the normal taggers. One can explicitly use the train() method after creating it.

Code #1 : Using train() method

from nltk.tag import tnt 

from nltk.corpus import treebank 

# initializing training and testing set     

train_data = treebank.tagged_sents()[:3000] 

test_data = treebank.tagged_sents()[3000:] 

# initializing tagger 

tnt_tagging = tnt.TnT() 

# training 
tnt_tagging.train(train_data) 

# evaluating 

a = tnt_tagging.evaluate(test_data) 

print ("Accuracy of TnT Tagging : ", a)

Output :

Accuracy of TnT Tagging : 0.8756313403842003

Understanding the working of TnT tagger :

It maintains the number of
- internal FreqDist
- ConditionalFreqDist, which is based on the training data.
Frequency Distribution (FreqDist) counts the unigrams, bigrams and trigrams.
These frequencies are used for calculations of the probabilities of possible tags for each word.
TnT tagger uses all the ngram models together to choose the best tag instead of constructing a backoff chain of NgramTagger subclasses.
Based on the probabilities of each possible tag, it chooses the most likely model for entire sentence.

Code #2 : Using tagger for unknown words as ‘unk’

from nltk.tag import tnt 

from nltk.corpus import treebank 

from nltk.tag import DefaultTagger 

# initializing training and testing set     

train_data = treebank.tagged_sents()[:3000] 

test_data = treebank.tagged_sents()[3000:] 

# initializing tagger 

unk = DefaultTagger('NN') 

tnt_tagging = tnt.TnT(unk = unk, Trained = True) 

# training  
tnt_tagging.train(train_data) 

# evaluating 

a = tnt_tagging.evaluate(test_data) 

print ("Accuracy of TnT Tagging : ", a)

Output :

Accuracy of TnT Tagging : 0.892467083962875

unknown tagger’s tag() method is only called with a single word sentence.
TnT tagger can pass in a tagger for unknown words as unk.
One can pass in Trained = True, if this tagger is already trained.
Otherwise, it will call unk.train(data) with the same data one can pass into the train() method.

Controlling Beam Search :

Another parameter to modify for TnT is N i.e. it controls the no. of possible solutions the tagger maintains.
By defaults N = 1000.
Amount of memory will increase if increase the value of N, without any specific increase of accuracy.
Amount of memory will decrease if decrease the value of N, but can decrease the accuracy.

Code #3 : Using N = 100

from nltk.tag import tnt 

from nltk.corpus import treebank 

from nltk.tag import DefaultTagger 

# initializing training and testing set     

train_data = treebank.tagged_sents()[:3000] 

test_data = treebank.tagged_sents()[3000:] 

# initializing tagger 

tnt_tagger = tnt.TnT(N = 100) 

# training  
tnt_tagging.train(train_data) 

# evaluating 

a = tnt_tagging.evaluate(test_data) 

print ("Accuracy of TnT Tagging : ", a)

Output :

Accuracy of TnT Tagging : 0.8756313403842003

Article Tags :

AI-ML-DS

NLP

Python

Natural-language-processing

Python-nltk