NLP | Customization Using Tagged Corpus Reader

How we can use Tagged Corpus Reader ?

  • Customizing word tokenizer
  • Customizing sentence tokenizer
  • Customizing paragraph block reader
  • Customizing tag separator
  • Converting tags to a universal tagset

Code #1 : Customizing word tokenizer

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.tokenize import SpaceTokenizer
from nltk.corpus.reader import TaggedCorpusReader
  
x = TaggedCorpusReader('.', r'.*\.pos'
                       word_tokenizer = SpaceTokenizer())
  
x.words()

chevron_right


Output :

['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Code #2 : For sentence

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.tokenize import LineTokenizer
from nltk.corpus.reader import TaggedCorpusReader
  
x = TaggedCorpusReader('.', r'.*\.pos'
                       sent_tokenizer = LineTokenizer())
  
x.sents()

chevron_right


Output :

[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

Customizing paragraph

  • Assume paragraphs to be split by blank lines
  • Done with the para_block_reader function, which is nltk.corpus.reader.util.read_blankline_block
  • Number of other block reader are present in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream.

Customizing Tag seperator

  • If ‘/’ is not used as the word/tag separator, one can pass an alternative string to TaggedCorpusReader for sep.
  • Default is sep=’/’, but if one wants to split words and tags with ‘|’, such as ‘word|tag’, then sep=’|’ is passed in .

Converting tags to a universal tagset
Tagset : It is a list of POS tags used by one or more corpora.
Universal Tagset : It is a simplified and condensed tagset composed of only 12 part-of-speech tags

Code #3 : map corpus tags to the universal tagset

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import TaggedCorpusReader
  
x = TaggedCorpusReader('.', r'.*\.pos', tagset ='en-brown')
x.tagged_words(tagset ='universal')

chevron_right


Output :

[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...] 

Code #4 : map corpus tags to the universal tagset

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import TaggedCorpusReader
from nltk.corpus import treebank
  
treebank.tagged_words()
  
treebank.tagged_words(tagset ='universal')
  
treebank.tagged_words(tagset ='brown')

chevron_right


Output :

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ...]

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (', ', '.'), …]

[('Pierre', 'UNK'), ('Vinken', 'UNK'), (', ', 'UNK'), ...]


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.