Skip to content
Related Articles

Related Articles

Improve Article

NLP | Wordlist Corpus

  • Last Updated : 20 Feb, 2019

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How to create wordlist corpus?

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

    WordListCorpusReader class is one of the simplest CorpusReader classes. It



  • WordListCorpusReader – It is one of the simplest CorpusReader classes.
  • This class provides access to the files that contain list of words or one word per line
  • Wordlist file can be a CSV file or a txt file having one word in each line. In our wordlist file
    we have added : 
    geeks
    for
    geeks
    welcomes
    you
    to
    nlp
    articles
  • Two arguments to give
  • directory path containing the files
  • list of filenames

Code #1 : Creating a wordlist corpus




from nltk.corpus.reader import WordListCorpusReader
x = WordListCorpusReader('.', ['C:\\Users\\dell\\Desktop\\wordlist.txt'])
x.words()
  
x.fileids()

Output :

['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']

['C:\\Users\\dell\\Desktop\\wordlist.txt']

Code #2 : Accessing raw.




x.raw()
  
from nltk.tokenize import line_tokenize
print ("Wordlist : ", line_tokenize(x.raw()))

Output :

'geeks\r\nfor\r\ngeeks\r\nwelcomes\r\nyou\r\nto\r\nnlp\r\narticles'

Wordlist : ['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']

Code #3 : Accessing Name Wordlist corpus




# Accessing pre-defined wordlist
from nltk.corpus import names
  
print ("Path : ", names.fileids())
  
print ("\nNo. of female names : ", len(names.words('female.txt')))
  
print ("\nNo. of male names : ", len(names.words('male.txt')))

Output :

Path :  ['female.txt', 'male.txt']

No. of female names :  5001

No. of male names :  2943

Code #4 : Accessing English Wordlist corpus




# Accessing pre-defined wordlist
from nltk.corpus import words
  
print ("File : ", words.fileids())
  
print ("\nNo. of female names : ", len(words.words('en-basic')))
  
print ("\nNo. of male names : ", len(words.words('en')))

Output :

File :  ['en', 'en-basic']

No. of female names :  850

No. of male names :  235886



My Personal Notes arrow_drop_up
Recommended Articles
Page :