Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.
Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags()
and ieer_chunked_sents()
functions. This can be used to train the ClassifierChunker class
created in the Classification-based chunking.
Code #1 : ieertree2conlltags()
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieertree2conlltags(tree, tag = nltk.tag.pos_tag): words, ents = zip ( * tree.pos()) iobs = [] prev = None for ent in ents: if ent = = tree.label(): iobs.append( 'O' ) prev = None elif prev = = ent: iobs.append( 'I-% s' % ent) else : iobs.append( 'B-% s' % ent) prev = ent words, tags = zip ( * tag(words)) return zip (words, tags, iobs) |
Code #2 : ieer_chunked_sents()
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieer_chunked_sents(tag = nltk.tag.pos_tag): for doc in ieer.parsed_docs(): tagged = ieertree2conlltags(doc.text, tag) yield conlltags2tree(tagged) |
Using 80 out of 94 sentences for training and the remaining ones for testing.
Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.
from nltk.corpus import ieer from chunkers import ieer_chunked_sents, ClassifierChunker from nltk.corpus import treebank_chunk ieer_chunks = list (ieer_chunked_sents()) print ( "Length of ieer_chunks : " , len (ieer_chunks)) # initializing chunker chunker = ClassifierChunker(ieer_chunks[: 80 ]) print ( "\nparsing : \n" , chunker.parse( treebank_chunk.tagged_sents()[ 0 ])) # evaluating score = chunker.evaluate(ieer_chunks[ 80 :]) a = score.accuracy() p = score.precision() r = score.recall() print ( "\nAccuracy : " , a) print ( "\nPrecision : " , p) print ( "\nRecall : " , r) |
Output :
Length of ieer_chunks : 94 parsing : Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]), Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')]) Accuracy : 0.8829018388070625 Precision : 0.4088717454194793 Recall : 0.5053635280095352
How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.