Open In App

NLP | Training Unigram Tagger

Last Updated : 04 Aug, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

A single token is referred to as a Unigram, for example – hello; movie; coding. This article is focused on unigram tagger. Unigram Tagger: For determining the Part of Speech tag, it only uses a single word. UnigramTagger inherits from NgramTagger, which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger. So, UnigramTagger is a single word context-based tagger. Code #1 : Training UnigramTagger. 

Python3




# Loading Libraries
from nltk.tag import UnigramTagger
from nltk.corpus import treebank


  Code #2 : Training using first 1000 tagged sentences of the treebank corpus as data. 

Python3




# Using data
train_sents = treebank.tagged_sents()[:1000]
 
# Initializing
tagger = UnigramTagger(train_sents)
 
# Lets see the first sentence
# (of the treebank corpus) as list  
treebank.sents()[0]


Output : 

['Pierre',
 'Vinken',
 ', ',
 '61',
 'years',
 'old',
 ', ',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

  Code #3 : Finding the tagged results after training. 

Python3




tagger.tag(treebank.sents()[0])


Output : 

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (', ', ', '),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (', ', ', '),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

  How does the code work? UnigramTagger builds a context model from the list of tagged sentences. Because UnigramTagger inherits from ContextTagger, instead of providing a choose_tag() method, it must implement a context() method, which takes the same three arguments a choose_tag(). The context token is used to create the model, and also to look up the best tag once the model is created. This is explained graphically in the above diagram also. Overriding the context model – All taggers, inherited from ContextTagger instead of training their own model can take a pre-built model. This model is simply a Python dictionary mapping a context key to a tag. The context keys (individual words in case of UnigramTagger) will depend on what the ContextTagger subclass returns from its context() method.   Code #4 : Overriding the context model 

Python3




tagger = UnigramTagger(model ={'Pierre': 'NN'})
 
tagger.tag(treebank.sents()[0])


Output : 

[('Pierre', 'NN'),
 ('Vinken', None),
 (', ', None),
 ('61', None),
 ('years', None),
 ('old', None),
 (', ', None),
 ('will', None),
 ('join', None),
 ('the', None),
 ('board', None),
 ('as', None),
 ('a', None),
 ('nonexecutive', None),
 ('director', None),
 ('Nov.', None),
 ('29', None),
 ('.', None)]


Similar Reads

NLP | Training Tagger Based Chunker | Set 1
To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunkers. In the codes below, we are using treebank_chunk
2 min read
NLP | Training Tagger Based Chunker | Set 2
Conll2000 corpus defines the chunks using IOB tags. It specifies where the chunk begins and ends, along with its types.A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.First using the chunked_sents() method of corpus, a tree is obtained and is then transformed to a format usable by a part-of-speech tagge
3 min read
NLP | Brill Tagger
BrillTagger class is a transformation-based tagger. It is not a subclass of SequentialBackoffTagger. Moreover, it uses a series of rules to correct the results of an initial tagger. These rules it follows are scored based. This score is equal to the no. of errors they correct minus the no. of new errors they produce. Code #1 : Training a BrillTagge
2 min read
NLP | Named Entity Chunker Training
Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform. Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags() and ieer_chunked_sents()
2 min read
NLP | Training a tokenizer and filtering stopwords in a sentence
Why do we need to train a sentence tokenizer? In NLTK, default sentence tokenizer works for the general purpose and it works very well. But there are chances that it won't work best for some kind of text as that text may use nonstandard punctuation or maybe it is having a unique format. So, to handle such cases, training sentence tokenizer can resu
3 min read
Visualizing training with TensorBoard
In machine learning, to improve something you often need to be able to measure it. TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like loss and accuracy, visualizing the model graph, projecting NLP embeddings to a lower-dimensional space, an
6 min read
Training Neural Networks using Pytorch Lightning
Introduction: PyTorch Lightning is a library that provides a high-level interface for PyTorch. Problem with PyTorch is that every time you start a project you have to rewrite those training and testing loop. PyTorch Lightning fixes the problem by not only reducing boilerplate code but also providing added functionality that might come handy while t
7 min read
Training Neural Networks with Validation using PyTorch
Neural Networks are a biologically-inspired programming paradigm that deep learning is built around. Python provides various libraries using which you can create and train neural networks over given data. PyTorch is one such library that provides us with various utilities to build and train neural networks easily. When it comes to Neural Networks i
8 min read
Training of Convolutional Neural Network (CNN) in TensorFlow
In this article, we are going to implement and train a convolutional neural network CNN using TensorFlow a massive machine learning library. Now in this article, we are going to work on a dataset called 'rock_paper_sissors' where we need to simply classify the hand signs as rock paper or scissors. Stepwise ImplementationStep 1: Importing the librar
5 min read
Training vs Testing vs Validation Sets
In this article, we are going to see how to Train, Test and Validate the Sets. The fundamental purpose for splitting the dataset is to assess how effective will the trained model be in generalizing to new data. This split can be achieved by using train_test_split function of scikit-learn. Training Set This is the actual dataset from which a model t
7 min read