Skip to content
Related Articles

Related Articles

NLP | Categorized Text Corpus
  • Last Updated : 20 Feb, 2019

If we have a large number of text data, then one can categorize it to separate sections.

Code #1 : Categorization




# Loading brown corpus
from nltk.corpus import brown
  
brown.categories()

Output :

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']

How to do categorize a corpus ?
Easiest way is to have one file for each category. The following are two excerpts from the movie_reviews corpus:



  • movie_pos.txt
  • movie_neg.txt

Using these two files, we’ll have two categories – pos and neg.

Code #2 : Let’s categorize




from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
  
print ("Categorize : ", reader.categories())
  
print ("\nNegative field : ", reader.fileids(categories =['neg']))
  
print ("\nPosiitve field : ", reader.fileids(categories =['pos']))

Output :

Categorize : ['neg', 'pos']

Negative field : ['movie_neg.txt']

Posiitve field : ['movie_pos.txt']

Code #3 : Instead of cat_pattern, using in a cat_map




from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_map ={'movie_pos.txt': ['pos'], 
                                        'movie_neg.txt': ['neg']})
      
print ("Categorize : ", reader.categories())

Output :

Categorize : ['neg', 'pos']

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

My Personal Notes arrow_drop_up
Recommended Articles
Page :