Skip to content
Related Articles

Related Articles

Improve Article

NLP | Customization Using Tagged Corpus Reader

  • Last Updated : 17 Jun, 2021

How we can use Tagged Corpus Reader ? 
 

  • Customizing word tokenizer
  • Customizing sentence tokenizer
  • Customizing paragraph block reader
  • Customizing tag separator
  • Converting tags to a universal tagset

 

Code #1 : Customizing word tokenizer 
 

Python3






# Loading the libraries
from nltk.tokenize import SpaceTokenizer
from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos',
                       word_tokenizer = SpaceTokenizer())
 
x.words()

Output : 
 

['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Code #2 : For sentence 
 

Python3




# Loading the libraries
from nltk.tokenize import LineTokenizer
from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos',
                       sent_tokenizer = LineTokenizer())
 
x.sents()

Output : 
 

[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

Customizing paragraph 
 

  • Assume paragraphs to be split by blank lines
  • Done with the para_block_reader function, which is nltk.corpus.reader.util.read_blankline_block
  • Number of other block reader are present in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream.

Customizing Tag separator 
 

  • If ‘/’ is not used as the word/tag separator, one can pass an alternative string to TaggedCorpusReader for sep.
  • Default is sep=’/’, but if one wants to split words and tags with ‘|’, such as ‘word|tag’, then sep=’|’ is passed in .

Converting tags to a universal tagset 
Tagset : It is a list of POS tags used by one or more corpora. 
Universal Tagset : It is a simplified and condensed tagset composed of only 12 part-of-speech tags
Code #3 : map corpus tags to the universal tagset 
 

Python3






from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos', tagset ='en-brown')
x.tagged_words(tagset ='universal')

Output : 
 

[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...] 

Code #4 : map corpus tags to the universal tagset 
 

Python3




from nltk.corpus.reader import TaggedCorpusReader
from nltk.corpus import treebank
 
treebank.tagged_words()
 
treebank.tagged_words(tagset ='universal')
 
treebank.tagged_words(tagset ='brown')

Output : 
 

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ...]

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (', ', '.'), …]

[('Pierre', 'UNK'), ('Vinken', 'UNK'), (', ', 'UNK'), ...]

 

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :