Open In App

NLP | Customization Using Tagged Corpus Reader

How we can use Tagged Corpus Reader ? 
 

 



Code #1 : Customizing word tokenizer 
 






# Loading the libraries
from nltk.tokenize import SpaceTokenizer
from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos',
                       word_tokenizer = SpaceTokenizer())
 
x.words()

Output : 
 

['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Code #2 : For sentence 
 




# Loading the libraries
from nltk.tokenize import LineTokenizer
from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos',
                       sent_tokenizer = LineTokenizer())
 
x.sents()

Output : 
 

[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

Customizing paragraph 
 

Customizing Tag separator 
 

Converting tags to a universal tagset 
Tagset : It is a list of POS tags used by one or more corpora. 
Universal Tagset : It is a simplified and condensed tagset composed of only 12 part-of-speech tags
Code #3 : map corpus tags to the universal tagset 
 




from nltk.corpus.reader import TaggedCorpusReader
 
x = TaggedCorpusReader('.', r'.*\.pos', tagset ='en-brown')
x.tagged_words(tagset ='universal')

Output : 
 

[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...] 

Code #4 : map corpus tags to the universal tagset 
 




from nltk.corpus.reader import TaggedCorpusReader
from nltk.corpus import treebank
 
treebank.tagged_words()
 
treebank.tagged_words(tagset ='universal')
 
treebank.tagged_words(tagset ='brown')

Output : 
 

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ...]

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (', ', '.'), …]

[('Pierre', 'UNK'), ('Vinken', 'UNK'), (', ', 'UNK'), ...]

 


Article Tags :