Open In App

NLP | Chunking using Corpus Reader

Last Updated : 24 Dec, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

What are Chunks? 
These are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.
How it works : 
 

  • The ChunkedCorpusReader class works similar to the TaggedCorpusReader for getting tagged tokens, plus it also provides three new methods for getting chunks.
  • An instance of nltk.tree.Tree represents each chunk.
  • Noun phrase trees look like Tree(‘NP’, […]) where as Sentence level trees look like Tree(‘S’, […]).
  • A list of sentence trees, with each noun phrase as a subtree of the sentence is obtained in n chunked_sents()
  • A list of noun phrase trees alongside tagged tokens of words that were not in a chunk is obtained in chunked_words().

Diagram listing the major methods: 
 

Code #1 : Creating a ChunkedCorpusReader for words 

Python3




# Using ChunkedCorpusReader
from nltk.corpus.reader import ChunkedCorpusReader
 
# initializing
x = ChunkedCorpusReader('.', r'.*\.chunk')
 
words = x.chunked_words()
print ("Words : \n", words)


Output : 

Words : 
[Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ...]

Code #2 : For sentence 

Python3




Chunked Sentence = x.chunked_sents()
print ("Chunked Sentence : \n", tagged_sent)


Output : 

Chunked Sentence : 
[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '),
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]

Code #3 : For paragraphs 

Python3




para = x.chunked_paras()()
print ("para : \n", para)


Output : 

[[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction',
'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'),
('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '), 
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]] 


Similar Reads

NLP | Customization Using Tagged Corpus Reader
How we can use Tagged Corpus Reader ? Customizing word tokenizerCustomizing sentence tokenizerCustomizing paragraph block readerCustomizing tag separatorConverting tags to a universal tagset Code #1 : Customizing word tokenizer C/C++ Code # Loading the libraries from nltk.tokenize import SpaceTokenizer from nltk.corpus.reader import TaggedCorpusRea
2 min read
NLP | Part of speech tagged - word corpus
What is Part-of-speech (POS) tagging ? It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on. Example of Part-of-speech (POS) tagged corpus The/at-tl expense/nn
2 min read
NLP | Categorized Text Corpus
If we have a large number of text data, then one can categorize it to separate sections. Code #1 : Categorization C/C++ Code # Loading brown corpus from nltk.corpus import brown brown.categories() Output : ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'rev
1 min read
NLP | Wordlist Corpus
What is a corpus? A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files. How to create wordlist corpus? WordListCorpusReader class is one of the simplest CorpusReader classes. It WordListCorpusReader - It is one of the simplest
2 min read
NLP | Custom corpus
What is a corpus? A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files. How it is done ? NLTK already defines a list of data paths or directories in nltk.data.path. Our custom corpora must be present within any of these given p
2 min read
NLP | Classifier-based Chunking | Set 2
Using the data from the treebank_chunk corpus let us evaluate the chunkers (prepared in the previous article). Code #1 : C/C++ Code # loading libraries from chunkers import ClassifierChunker from nltk.corpus import treebank_chunk train_data = treebank_chunk.chunked_sents()[:3000] test_data = treebank_chunk.chunked_sents()[3000:] # initializing chun
2 min read
NLP | Chunking and chinking with RegEx
Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech). Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can't be a part of chuck and such words are known as chinks. A ChunkRule cla
2 min read
NLP | Chunking Rules
Below are the steps involved for Chunking - Conversion of sentence to a flat tree. Creation of Chunk string using this tree.Creation of RegexpChunkParser by parsing the grammar using RegexpParser.Applying the created chunk rule to the ChunkString that matches the sentence into a chunk. Splitting the bigger chunk to a smaller chunk using the defined
2 min read
NLP | Classifier-based Chunking | Set 1
The ClassifierBasedTagger class learns from the features, unlike most part-of-speech taggers. ClassifierChunker class can be created such that it can learn from both the words and part-of-speech tags, instead of just from the part-of-speech tags as the TagChunker class does. The (word, pos, iob) 3-tuples is converted into ((word, pos), iob) 2-tuple
2 min read
NLP | Distributed chunking with Execnet
The article aims to perform chunking and tagging over an execnet gateway. Here two objects will be sent instead of one, and a Tree is received, which requires pickling and unpickling for serialization. How it works ? Use a pickled tagger. First, pickle the default chunker used by nltk.chunk.ne_chunk(), though any chunker would do. Next, make a gate
2 min read