NLP | Named Entity Chunker Training
Last Updated :
26 Feb, 2019
Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.
Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags()
and ieer_chunked_sents()
functions. This can be used to train the ClassifierChunker class
created in the Classification-based chunking.
Code #1 : ieertree2conlltags()
import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
def ieertree2conlltags(tree, tag = nltk.tag.pos_tag):
words, ents = zip ( * tree.pos())
iobs = []
prev = None
for ent in ents:
if ent = = tree.label():
iobs.append( 'O' )
prev = None
elif prev = = ent:
iobs.append( 'I-% s' % ent)
else :
iobs.append( 'B-% s' % ent)
prev = ent
words, tags = zip ( * tag(words))
return zip (words, tags, iobs)
|
Code #2 : ieer_chunked_sents()
import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer
def ieer_chunked_sents(tag = nltk.tag.pos_tag):
for doc in ieer.parsed_docs():
tagged = ieertree2conlltags(doc.text, tag)
yield conlltags2tree(tagged)
|
Using 80 out of 94 sentences for training and the remaining ones for testing.
Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.
from nltk.corpus import ieer
from chunkers import ieer_chunked_sents, ClassifierChunker
from nltk.corpus import treebank_chunk
ieer_chunks = list (ieer_chunked_sents())
print ( "Length of ieer_chunks : " , len (ieer_chunks))
chunker = ClassifierChunker(ieer_chunks[: 80 ])
print ( "\nparsing : \n" , chunker.parse(
treebank_chunk.tagged_sents()[ 0 ]))
score = chunker.evaluate(ieer_chunks[ 80 :])
a = score.accuracy()
p = score.precision()
r = score.recall()
print ( "\nAccuracy : " , a)
print ( "\nPrecision : " , p)
print ( "\nRecall : " , r)
|
Output :
Length of ieer_chunks : 94
parsing :
Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]),
(', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]),
Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])
Accuracy : 0.8829018388070625
Precision : 0.4088717454194793
Recall : 0.5053635280095352
How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().
Share your thoughts in the comments
Please Login to comment...