Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

NLP | Wordlist Corpus

  • Last Updated : 20 Feb, 2019

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How to create wordlist corpus?

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

    WordListCorpusReader class is one of the simplest CorpusReader classes. It

  • WordListCorpusReader – It is one of the simplest CorpusReader classes.
  • This class provides access to the files that contain list of words or one word per line
  • Wordlist file can be a CSV file or a txt file having one word in each line. In our wordlist file
    we have added : 
  • Two arguments to give
  • directory path containing the files
  • list of filenames

Code #1 : Creating a wordlist corpus

from nltk.corpus.reader import WordListCorpusReader
x = WordListCorpusReader('.', ['C:\\Users\\dell\\Desktop\\wordlist.txt'])

Output :

['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']


Code #2 : Accessing raw.

from nltk.tokenize import line_tokenize
print ("Wordlist : ", line_tokenize(x.raw()))

Output :


Wordlist : ['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']

Code #3 : Accessing Name Wordlist corpus

# Accessing pre-defined wordlist
from nltk.corpus import names
print ("Path : ", names.fileids())
print ("\nNo. of female names : ", len(names.words('female.txt')))
print ("\nNo. of male names : ", len(names.words('male.txt')))

Output :

Path :  ['female.txt', 'male.txt']

No. of female names :  5001

No. of male names :  2943

Code #4 : Accessing English Wordlist corpus

# Accessing pre-defined wordlist
from nltk.corpus import words
print ("File : ", words.fileids())
print ("\nNo. of female names : ", len(words.words('en-basic')))
print ("\nNo. of male names : ", len(words.words('en')))

Output :

File :  ['en', 'en-basic']

No. of female names :  850

No. of male names :  235886

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!