NLP | Named Entity Chunker Training

Last Updated : 26 Feb, 2019

Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.

Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags() and ieer_chunked_sents() functions. This can be used to train the ClassifierChunker class created in the Classification-based chunking.

Code #1 : ieertree2conlltags()

import nltk.tag 
from nltk.chunk.util import conlltags2tree 
from nltk.corpus import ieer 
  
def ieertree2conlltags(tree, tag = nltk.tag.pos_tag): 
    words, ents = zip(*tree.pos()) 
    iobs = [] 
    prev = None
    for ent in ents: 
        if ent == tree.label(): 
            iobs.append('O') 
            prev = None
        elif prev == ent: 
            iobs.append('I-% s' % ent) 
        else: 
            iobs.append('B-% s' % ent) 
            prev = ent 
      
    words, tags = zip(*tag(words)) 
      
    return zip(words, tags, iobs) 

Code #2 : ieer_chunked_sents()

import nltk.tag 
from nltk.chunk.util import conlltags2tree 
from nltk.corpus import ieer 
  
def ieer_chunked_sents(tag = nltk.tag.pos_tag): 
    for doc in ieer.parsed_docs(): 
        tagged = ieertree2conlltags(doc.text, tag) 
        yield conlltags2tree(tagged) 

Using 80 out of 94 sentences for training and the remaining ones for testing.

Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.

from nltk.corpus import ieer 
from chunkers import ieer_chunked_sents, ClassifierChunker 
from nltk.corpus import treebank_chunk 
  
ieer_chunks = list(ieer_chunked_sents()) 
  
print ("Length of ieer_chunks : ", len(ieer_chunks)) 
  
# initializing chunker 
chunker = ClassifierChunker(ieer_chunks[:80]) 
print("\nparsing : \n", chunker.parse( 
        treebank_chunk.tagged_sents()[0])) 
  
# evaluating 
score = chunker.evaluate(ieer_chunks[80:]) 
  
a = score.accuracy() 
p = score.precision() 
r = score.recall() 
  
print ("\nAccuracy : ", a) 
print ("\nPrecision : ", p) 
print ("\nRecall : ", r) 

Output :

Length of ieer_chunks : 94

parsing : 
Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]),
(', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]),
Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'), 
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

Accuracy : 0.8829018388070625

Precision : 0.4088717454194793

Recall : 0.5053635280095352

How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().

Suggest improvement

NLP | Training Tagger Based Chunker | Set 1

Share your thoughts in the comments

NLP | Named Entity Chunker Training

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?