Skip to content
Related Articles
Open in App
Not now

Related Articles

NLP | Custom corpus

Improve Article
Save Article
  • Last Updated : 20 Feb, 2019
Improve Article
Save Article

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How it is done ?
NLTK already defines a list of data paths or directories in Our custom corpora must be present within any of these given paths so it can be found by NLTK.
We can also create a custom nltk_data directory in our home directory and verify that it is in the list of known paths specified by

Code #1 : Creating a custom directory and verify.

# importing libraries
import os, os.path
# using the given path
path = os.path.expanduser('~/nltk_data')
# checking
if not os.path.exists(path):
print ("Does path exists : ", os.path.exists(path))
print ("\nDoes path exists in nltk : "
       path in

Output :

Does path exists : True
Does path exists in nltk : True

Code #2 : Creating a wordlist file.

# loading libraries
import'corpora/cookbook/word_file.txt', format ='raw')

Output :


How all this works ?

  • recognizes the formats – ‘raw’, ‘pickle’ and ‘yaml’.
  • It guess the format based on the file’s extension, if format is not given.
  • As in the code above, ‘raw’ format is needed to be specified.
  • As in the code above, ‘raw’ format is needed to be specified.
  • If file ends in ‘.yaml’, then no need to specify the format.

Code #3 : How to load a YAML file

# loading file using the path'corpora/cookbook/synonyms.yaml')

Output :

{'bday': 'birthday'}

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!