Open In App

NLP | Named Entity Chunker Training

Last Updated : 26 Feb, 2019
Like Article

Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.

Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags() and ieer_chunked_sents() functions. This can be used to train the ClassifierChunker class created in the Classification-based chunking.

Code #1 : ieertree2conlltags()

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
def ieertree2conlltags(tree, tag = nltk.tag.pos_tag):
    words, ents = zip(*tree.pos())
    iobs = []
    prev = None
    for ent in ents:
        if ent == tree.label():
            prev = None
        elif prev == ent:
            iobs.append('I-% s' % ent)
            iobs.append('B-% s' % ent)
            prev = ent
    words, tags = zip(*tag(words))
    return zip(words, tags, iobs)

Code #2 : ieer_chunked_sents()

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
def ieer_chunked_sents(tag = nltk.tag.pos_tag):
    for doc in ieer.parsed_docs():
        tagged = ieertree2conlltags(doc.text, tag)
        yield conlltags2tree(tagged)

Using 80 out of 94 sentences for training and the remaining ones for testing.
Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.

from nltk.corpus import ieer
from chunkers import ieer_chunked_sents, ClassifierChunker
from nltk.corpus import treebank_chunk
ieer_chunks = list(ieer_chunked_sents())
print ("Length of ieer_chunks : ", len(ieer_chunks))
# initializing chunker
chunker = ClassifierChunker(ieer_chunks[:80])
print("\nparsing : \n", chunker.parse(
# evaluating
score = chunker.evaluate(ieer_chunks[80:])
a = score.accuracy()
p = score.precision()
r = score.recall()
print ("\nAccuracy : ", a)
print ("\nPrecision : ", p)
print ("\nRecall : ", r)

Output :

Length of ieer_chunks : 94

parsing : 
Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]),
(', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]),
Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'), 
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

Accuracy : 0.8829018388070625

Precision : 0.4088717454194793

Recall : 0.5053635280095352

How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().

Similar Reads

NLP | Training Tagger Based Chunker | Set 1
To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunkers. In the codes below, we are using treebank_chunk
2 min read
NLP | Training Tagger Based Chunker | Set 2
Conll2000 corpus defines the chunks using IOB tags. It specifies where the chunk begins and ends, along with its types.A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.First using the chunked_sents() method of corpus, a tree is obtained and is then transformed to a format usable by a part-of-speech tagge
3 min read
Named Entity Recognition in NLP
In this article, we'll dive into the various concepts related to NER, explain the steps involved in the process, and understand it with some good examples. Named Entity Recognition (NER) is a critical component of Natural Language Processing (NLP) that has gained significant attention and research interest in recent years. It involves identifying a
6 min read
Named Entity Recognition
Named Entity Recognition (NER) is a technique in natural language processing (NLP) that focuses on identifying and classifying entities. The purpose of NER is to automatically extract structured information from unstructured text, enabling machines to understand and categorize entities in a meaningful manner for various applications like text summa
7 min read
NLP | Extracting Named Entities
Recognizing named entity is a specific kind of chunk extraction that uses entity tags along with chunk tags. Common entity tags include PERSON, LOCATION and ORGANIZATION. POS tagged sentences are parsed into chunk trees with normal chunking but the trees labels can be entity tags in place of chunk phrase tags. NLTK has already a pre-trained named e
2 min read
NLP | Training Unigram Tagger
A single token is referred to as a Unigram, for example - hello; movie; coding. This article is focused on unigram tagger. Unigram Tagger: For determining the Part of Speech tag, it only uses a single word. UnigramTagger inherits from NgramTagger, which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger. So, UnigramTagger i
2 min read
NLP | Training a tokenizer and filtering stopwords in a sentence
Why do we need to train a sentence tokenizer? In NLTK, default sentence tokenizer works for the general purpose and it works very well. But there are chances that it won't work best for some kind of text as that text may use nonstandard punctuation or maybe it is having a unique format. So, to handle such cases, training sentence tokenizer can resu
3 min read
HTML Cleaning and Entity Conversion | Python
The very important and always ignored task on web is the cleaning of text. Whenever one thinks to parse HTML, embedded Javascript and CSS is always avoided. The users are only interested in tags and text present on the webserver. lxml installation - It is a Python binding for C libraries - libxslt and libxml2. So maintaining a Python base, it is ve
3 min read
How to Fix Unprocessable Entity Error in ChatGPT
In the world of Artificial Intelligence, a game-changing innovation has emerged – ChatGPT by OpenAI. This remarkable creation revolutionizes daily life, aiding with coding, content creation, and more. Yet, the "Unprocessable Entity" error occasionally puzzles users, interrupting their experience. This article explores the causes and solutions for t
5 min read
Python program to create dynamically named variables from user input
Given a string input, our task is to write a Python program to create a variable from that input (as a variable name) and assign it to some value. Below are the methods to create dynamically named variables from user input. Using globals() method to create dynamically named variables Here we are using the globals() method for creating a dynamically
2 min read
Article Tags :
Practice Tags :