NLP | Trigrams’n’Tags (TnT) Tagging

TnT Tagger : It is a statistical tagger that works on second-order Markov models.

  • It is a very efficient part-of-speech tagger that can be trained on different languages and on any tagset.
  • For parameter generation, the component trains on tagged corpora. It incorporates different methods of smoothing and handling unknown words
  • Linear interpolation is used for smoothing, the respective weights are determined by deleted interpolation.

TnT tagger has different API than the normal taggers. One can explicitly use the train() method after creating it.

Code #1 : Using train() method

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.tag import tnt
from nltk.corpus import treebank
  
# intializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
# initializing tagger
tnt_tagging = tnt.TnT()
  
# training
tnt_tagging.train(train_data)
  
# evaluating
a = tnt_tagging.evaluate(test_data)
  
print ("Accuracy of TnT Tagging : ", a)

chevron_right


Output :

Accuracy of TnT Tagging : 0.8756313403842003

Understanding the working of TnT tagger :

  • It maintains the number of
    • internal FreqDist
    • ConditionalFreqDist, which is based on the training data.
  • Frequency Distribution (FreqDist) counts the unigrams, bigrams and trigrams.
  • These frequencies are used for calculations of the probabilities of possible tags for each word.
  • TnT tagger uses all the ngram models together to choose the best tag instead of constructing a backoff chain of NgramTagger subclasses.
  • Based on the probabilities of each possible tag, it chooses the most likely model for entire sentence.

Code #2 : Using tagger for unknown words as ‘unk’

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.tag import tnt
from nltk.corpus import treebank
from nltk.tag import DefaultTagger
  
# intializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
# initializing tagger
unk = DefaultTagger('NN')
tnt_tagging = tnt.TnT(unk = unk, Trained = True)
  
# training 
tnt_tagging.train(train_data)
  
# evaluating
a = tnt_tagging.evaluate(test_data)
  
print ("Accuracy of TnT Tagging : ", a)

chevron_right


Output :

Accuracy of TnT Tagging : 0.892467083962875
  • unknown tagger’s tag() method is only called with a single word sentence.
  • TnT tagger can pass in a tagger for unknown words as unk.
  • One can pass in Trained = True, if this tagger is already trained.
  • Otherwise, it will call unk.train(data) with the same data one can pass into the train() method.

Controlling Beam Search :

  • Another parameter to modify for TnT is N i.e. it controls the no. of possible solutions the tagger maintains.
  • By defaults N = 1000.
  • Amount of memory will increase if increase the value of N, without any specific increase of accuracy.
  • Amount of memory will decrease if decrease the value of N, but can decrease the accuracy.

Code #3 : Using N = 100

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.tag import tnt
from nltk.corpus import treebank
from nltk.tag import DefaultTagger
  
# intializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
# initializing tagger
tnt_tagger = tnt.TnT(N = 100)
  
# training 
tnt_tagging.train(train_data)
  
# evaluating
a = tnt_tagging.evaluate(test_data)
  
print ("Accuracy of TnT Tagging : ", a)

chevron_right


Output :

Accuracy of TnT Tagging : 0.8756313403842003


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.