NLP | Categorized Text Corpus

If we have a large number of text data, then one can categorize it to separate sections.

Code #1 : Categorization

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading brown corpus
from nltk.corpus import brown
  
brown.categories()

chevron_right


Output :

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']

How to do categorize a corpus ?
Easiest way is to have one file for each category. The following are two excerpts from the movie_reviews corpus:



  • movie_pos.txt
  • movie_neg.txt

Using these two files, we’ll have two categories – pos and neg.

Code #2 : Let’s categorize

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
  
print ("Categorize : ", reader.categories())
  
print ("\nNegative field : ", reader.fileids(categories =['neg']))
  
print ("\nPosiitve field : ", reader.fileids(categories =['pos']))

chevron_right


Output :

Categorize : ['neg', 'pos']

Negative field : ['movie_neg.txt']

Posiitve field : ['movie_pos.txt']

Code #3 : Instead of cat_pattern, using in a cat_map

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_map ={'movie_pos.txt': ['pos'], 
                                        'movie_neg.txt': ['neg']})
      
print ("Categorize : ", reader.categories())

chevron_right


Output :

Categorize : ['neg', 'pos']

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.