NLP | Training Tagger Based Chunker | Set 1

To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunkers.

In the codes below, we are using treebank_chunk corpus to produce chunked sentences in the form of trees.
-> To train a tagger-based chunker – chunked_sents() methods are used by a TagChunker class.
-> To extract a list of (pos, iob) tuples from a list of Trees – the TagChunker class uses a helper function, conll_tag_chunks().

These tuples are then finally used to train a tagger. and it learns IOB tags for part-of-speech tags.



Code #1 : Let’s understand the Chunker class for training.

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import UnigramTagger, BigramTagger
from tag_util import backoff_tagger
  
  
def conll_tag_chunks(chunk_data):
      
    tagged_data = [tree2conlltags(tree) for 
                    tree in chunk_data]
      
    return [[(t, c) for (w, t, c) in sent] 
            for sent in tagged_data]
      
class TagChunker(ChunkParserI):
      
    def __init__(self, train_chunks, 
                 tagger_classes =[UnigramTagger, BigramTagger]):
          
        train_data = conll_tag_chunks(train_chunks)
        self.tagger = backoff_tagger(train_data, tagger_classes)
          
    def parse(self, tagged_sent):
        if not tagged_sent: 
            return None
          
        (words, tags) = zip(*tagged_sent)
        chunks = self.tagger.tag(tags)
        wtc = zip(words, chunks)
          
        return conlltags2tree([(w, t, c) for (w, (t, c)) in wtc])

chevron_right


Output :

Training TagChunker

 
Code #2 : Using the Tag Chunker.

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading libraries
from chunkers import TagChunker
from nltk.corpus import treebank_chunk
  
# data from treebank_chunk corpus
train_data = treebank_chunk.chunked_sents()[:3000]
test_data = treebank_chunk.chunked_sents()[3000:]
  
# Initailazing 
chunker = TagChunker(train_data)

chevron_right


 
Code #3 : Evaluating the TagChunker

filter_none

edit
close

play_arrow

link
brightness_4
code

# testing
score = chunker.evaluate(test_data)
  
a = score.accuracy()
p = score.precision()
r = recall
  
print ("Accuracy of TagChunker : ", a)
print ("\nPrecision of TagChunker : ", p)
print ("\nRecall of TagChunker : ", r)

chevron_right


Output :

Accuracy of TagChunker : 0.9732039335251428

Precision of TagChunker : 0.9166534370535006

Recall of TagChunker : 0.9465573770491803


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.