NLP | Custom corpus

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How it is done ?
NLTK already defines a list of data paths or directories in nltk.data.path. Our custom corpora must be present within any of these given paths so it can be found by NLTK.
We can also create a custom nltk_data directory in our home directory and verify that it is in the list of known paths specified by nltk.data.path.

Code #1 : Creating a custom directory and verify.



filter_none

edit
close

play_arrow

link
brightness_4
code

# importing libraries
import os, os.path
  
# using the given path
path = os.path.expanduser('~/nltk_data')
  
# checking
if not os.path.exists(path):
    os.mkdir(path)
      
print ("Does path exists : ", os.path.exists(path))
  
  
import nltk.data
print ("\nDoes path exists in nltk : "
       path in nltk.data.path)

chevron_right


Output :

Does path exists : True
Does path exists in nltk : True

Code #2 : Creating a wordlist file.

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading libraries
import nltk.data
  
nltk.data.load('corpora/cookbook/word_file.txt', format ='raw')

chevron_right


Output :

b'nltk\n'

How all this works ?

  • nltk.data.load() recognizes the formats – ‘raw’, ‘pickle’ and ‘yaml’.
  • It guess the format based on the file’s extension, if format is not given.
  • As in the code above, ‘raw’ format is needed to be specified.
  • As in the code above, ‘raw’ format is needed to be specified.
  • If file ends in ‘.yaml’, then no need to specify the format.

Code #3 : How to load a YAML file

filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk.data
  
# loading file using the path
nltk.data.load('corpora/cookbook/synonyms.yaml')

chevron_right


Output :

{'bday': 'birthday'}


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.