NLP | Classifier-based Chunking | Set 1
The ClassifierBasedTagger class
learns from the features, unlike most part-of-speech taggers. ClassifierChunker class
can be created such that it can learn from both the words and part-of-speech tags, instead of just from the part-of-speech tags as the TagChunker class
does.
The (word, pos, iob) 3-tuples is converted into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks()
from tree2conlltags()
, to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class
.
Code #1 : Let’s understand
from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger
def chunk_trees2train_chunks(chunk_sents):
tag_sents = [tree2conlltags(sent) for
sent in chunk_sents]
3 - tuple is converted to 2 - tuple
return [[((w, t), c) for
(w, t, c) in sent] for sent in tag_sents]
|
Now, a feature detector function is needed to pass into ClassifierBasedTagger. Any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag.
Code #2 : detector function
def prev_next_pos_iob(tokens, index, history):
word, pos = tokens[index]
if index = = 0 :
prevword, prevpos, previob = ( '<START>' , ) * 3
else :
prevword, prevpos = tokens[index - 1 ]
previob = history[index - 1 ]
if index = = len (tokens) - 1 :
nextword, nextpos = ( '<END>' , ) * 2
else :
nextword, nextpos = tokens[index + 1 ]
feats = { 'word' : word,
'pos' : pos,
'nextword' : nextword,
'nextpos' : nextpos,
'prevword' : prevword,
'prevpos' : prevpos,
'previob' : previob
}
return feats
|
Now, ClassifierChunker class
is need which uses an internal ClassifierBasedTagger
with training sentences from chunk_trees2train_chunks()
and features extracted using prev_next_pos_iob()
. As a subclass of ChunkerParserI
, ClassifierChunker
implements the parse()
method to convert the ((w, t), c) tuples, produced by the internal tagger into Trees using conlltags2tree()
Code #3 :
class ClassifierChunker(ChunkParserI):
def __init__( self , train_sents,
feature_detector = prev_next_pos_iob, * * kwargs):
if not feature_detector:
feature_detector = self .feature_detector
train_chunks = chunk_trees2train_chunks(train_sents)
self .tagger = ClassifierBasedTagger(train = train_chunks,
feature_detector = feature_detector, * * kwargs)
def parse( self , tagged_sent):
if not tagged_sent: return None
chunks = self .tagger.tag(tagged_sent)
return conlltags2tree(
[(w, t, c) for ((w, t), c) in chunks])
|
Last Updated :
23 Feb, 2019
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...