NLP | Classifier-based Chunking | Set 1

The ClassifierBasedTagger class learns from the features, unlike most part-of-speech taggers. ClassifierChunker class can be created such that it can learn from both the words and part-of-speech tags, instead of just from the part-of-speech tags as the TagChunker class does.

The (word, pos, iob) 3-tuples is converted into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks() from tree2conlltags(), to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class.
 
Code #1 : Let’s understand

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Libraries
from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger
  
def chunk_trees2train_chunks(chunk_sents):
  
    # Using tree2conlltags
    tag_sents = [tree2conlltags(sent) for 
                 sent in chunk_sents]
  
    3-tuple is converted to 2-tuple
    return [[((w, t), c) for 
             (w, t, c) in sent] for sent in tag_sents]

chevron_right


Now, a feature detector function is needed to pass into ClassifierBasedTagger. Any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag.
 

Code #2 : detector function

filter_none

edit
close

play_arrow

link
brightness_4
code

def prev_next_pos_iob(tokens, index, history):
      
    word, pos = tokens[index]
    if index == 0:
        prevword, prevpos, previob = ('<START>', )*3
    else:
        prevword, prevpos = tokens[index-1]
        previob = history[index-1]
          
    if index == len(tokens) - 1:
        nextword, nextpos = ('<END>', )*2
    else:
        nextword, nextpos = tokens[index + 1]
        feats = {'word': word,
                 'pos': pos,
                 'nextword': nextword,
                 'nextpos': nextpos,
                 'prevword': prevword,
                 'prevpos': prevpos,
                 'previob': previob
                 }
    return feats

chevron_right


Now, ClassifierChunker class is need which uses an internal ClassifierBasedTagger with training sentences from chunk_trees2train_chunks() and features extracted using prev_next_pos_iob(). As a subclass of ChunkerParserI, ClassifierChunker implements the parse() method to convert the ((w, t), c) tuples, produced by the internal tagger into Trees using conlltags2tree()



Code #3 :

filter_none

edit
close

play_arrow

link
brightness_4
code

class ClassifierChunker(ChunkParserI):
    def __init__(self, train_sents, 
                 feature_detector = prev_next_pos_iob, **kwargs):
          
        if not feature_detector:
            feature_detector = self.feature_detector
            train_chunks = chunk_trees2train_chunks(train_sents)
            self.tagger = ClassifierBasedTagger(train = train_chunks,
            feature_detector = feature_detector, **kwargs)
              
    def parse(self, tagged_sent):
          
        if not tagged_sent: return None
        chunks = self.tagger.tag(tagged_sent)
          
        return conlltags2tree(
                [(w, t, c) for ((w, t), c) in chunks])

chevron_right


Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.