NLP | Named Entity Chunker Training

Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.

Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags() and ieer_chunked_sents() functions. This can be used to train the ClassifierChunker class created in the Classification-based chunking.

Code #1 : ieertree2conlltags()



filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
  
def ieertree2conlltags(tree, tag = nltk.tag.pos_tag):
    words, ents = zip(*tree.pos())
    iobs = []
    prev = None
    for ent in ents:
        if ent == tree.label():
            iobs.append('O')
            prev = None
        elif prev == ent:
            iobs.append('I-% s' % ent)
        else:
            iobs.append('B-% s' % ent)
            prev = ent
      
    words, tags = zip(*tag(words))
      
    return zip(words, tags, iobs)

chevron_right


 
Code #2 : ieer_chunked_sents()

filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
  
def ieer_chunked_sents(tag = nltk.tag.pos_tag):
    for doc in ieer.parsed_docs():
        tagged = ieertree2conlltags(doc.text, tag)
        yield conlltags2tree(tagged)

chevron_right


Using 80 out of 94 sentences for training and the remaining ones for testing.
 
Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus import ieer
from chunkers import ieer_chunked_sents, ClassifierChunker
from nltk.corpus import treebank_chunk
  
ieer_chunks = list(ieer_chunked_sents())
  
print ("Length of ieer_chunks : ", len(ieer_chunks))
  
# initializing chunker
chunker = ClassifierChunker(ieer_chunks[:80])
print("\nparsing : \n", chunker.parse(
        treebank_chunk.tagged_sents()[0]))
  
# evaluating
score = chunker.evaluate(ieer_chunks[80:])
  
a = score.accuracy()
p = score.precision()
r = score.recall()
  
print ("\nAccuracy : ", a)
print ("\nPrecision : ", p)
print ("\nRecall : ", r)

chevron_right


Output :

Length of ieer_chunks : 94

parsing : 
Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]),
(', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]),
Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'), 
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

Accuracy : 0.8829018388070625

Precision : 0.4088717454194793

Recall : 0.5053635280095352

How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().



My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.