NLP | Categorized Text Corpus

If we have a large number of text data, then one can categorize it to separate sections.

Code #1 : Categorization

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading brown corpus
from nltk.corpus import brown
  
brown.categories()

chevron_right


Output :

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']

How to do categorize a corpus ?
Easiest way is to have one file for each category. The following are two excerpts from the movie_reviews corpus:

  • movie_pos.txt
  • movie_neg.txt

Using these two files, we’ll have two categories – pos and neg.

Code #2 : Let’s categorize

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
  
print ("Categorize : ", reader.categories())
  
print ("\nNegative field : ", reader.fileids(categories =['neg']))
  
print ("\nPosiitve field : ", reader.fileids(categories =['pos']))

chevron_right


Output :

Categorize : ['neg', 'pos']

Negative field : ['movie_neg.txt']

Posiitve field : ['movie_pos.txt']

Code #3 : Instead of cat_pattern, using in a cat_map

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_map ={'movie_pos.txt': ['pos'], 
                                        'movie_neg.txt': ['neg']})
      
print ("Categorize : ", reader.categories())

chevron_right


Output :

Categorize : ['neg', 'pos']


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.