Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

NLP | Custom corpus

  • Last Updated : 20 Feb, 2019

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How it is done ?
NLTK already defines a list of data paths or directories in Our custom corpora must be present within any of these given paths so it can be found by NLTK.
We can also create a custom nltk_data directory in our home directory and verify that it is in the list of known paths specified by

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Code #1 : Creating a custom directory and verify.

# importing libraries
import os, os.path
# using the given path
path = os.path.expanduser('~/nltk_data')
# checking
if not os.path.exists(path):
print ("Does path exists : ", os.path.exists(path))
print ("\nDoes path exists in nltk : "
       path in

Output :

Does path exists : True
Does path exists in nltk : True

Code #2 : Creating a wordlist file.

# loading libraries
import'corpora/cookbook/word_file.txt', format ='raw')

Output :


How all this works ?

  • recognizes the formats – ‘raw’, ‘pickle’ and ‘yaml’.
  • It guess the format based on the file’s extension, if format is not given.
  • As in the code above, ‘raw’ format is needed to be specified.
  • As in the code above, ‘raw’ format is needed to be specified.
  • If file ends in ‘.yaml’, then no need to specify the format.

Code #3 : How to load a YAML file

# loading file using the path'corpora/cookbook/synonyms.yaml')

Output :

{'bday': 'birthday'}

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!