If we have a large number of text data, then one can categorize it to separate sections.
Code #1 : Categorization
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
How to do categorize a corpus ?
Easiest way is to have one file for each category. The following are two excerpts from the movie_reviews corpus:
Using these two files, we’ll have two categories – pos and neg.
Code #2 : Let’s categorize
Categorize : ['neg', 'pos'] Negative field : ['movie_neg.txt'] Posiitve field : ['movie_pos.txt']
Code #3 : Instead of cat_pattern, using in a cat_map
Categorize : ['neg', 'pos']
- NLP | Part of speech tagged - word corpus
- NLP | Chunking using Corpus Reader
- NLP | Customization Using Tagged Corpus Reader
- NLP | Wordlist Corpus
- NLP | Custom corpus
- Processing text using NLP | Basics
- NLP | How tokenizing text, sentence, words works
- NLP | Chunk Tree to Text and Chaining Chunk Transformation
- NLP | Classifier-based Chunking | Set 2
- Readability Index in Python(NLP)
- Feature Extraction Techniques - NLP
- Python | NLP analysis of Restaurant reviews
- Applying Multinomial Naive Bayes to NLP Problems
- NLP | Chunking and chinking with RegEx
- NLP | Training Unigram Tagger
- NLP | Synsets for a word in WordNet
- NLP | Part of Speech - Default Tagging
- NLP | Word Collocations
- NLP | WuPalmer - WordNet Similarity
- NLP | Training a tokenizer and filtering stopwords in a sentence
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.