NLP | Wordlist Corpus

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How to create wordlist corpus?

    WordListCorpusReader class is one of the simplest CorpusReader classes. It



  • WordListCorpusReader – It is one of the simplest CorpusReader classes.
  • This class provides access to the files that contain list of words or one word per line
  • Wordlist file can be a CSV file or a txt file having one word in each line. In our wordlist file
    we have added : 
    geeks
    for
    geeks
    welcomes
    you
    to
    nlp
    articles
  • Two arguments to give
  • directory path containing the files
  • list of filenames

Code #1 : Creating a wordlist corpus

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus.reader import WordListCorpusReader
x = WordListCorpusReader('.', ['C:\\Users\\dell\\Desktop\\wordlist.txt'])
x.words()
  
x.fileids()

chevron_right


Output :

['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']

['C:\\Users\\dell\\Desktop\\wordlist.txt']

Code #2 : Accessing raw.

filter_none

edit
close

play_arrow

link
brightness_4
code

x.raw()
  
from nltk.tokenize import line_tokenize
print ("Wordlist : ", line_tokenize(x.raw()))

chevron_right


Output :

'geeks\r\nfor\r\ngeeks\r\nwelcomes\r\nyou\r\nto\r\nnlp\r\narticles'

Wordlist : ['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']

Code #3 : Accessing Name Wordlist corpus

filter_none

edit
close

play_arrow

link
brightness_4
code

# Accessing pre-defined wordlist
from nltk.corpus import names
  
print ("Path : ", names.fileids())
  
print ("\nNo. of female names : ", len(names.words('female.txt')))
  
print ("\nNo. of male names : ", len(names.words('male.txt')))

chevron_right


Output :

Path :  ['female.txt', 'male.txt']

No. of female names :  5001

No. of male names :  2943

Code #4 : Accessing English Wordlist corpus

filter_none

edit
close

play_arrow

link
brightness_4
code

# Accessing pre-defined wordlist
from nltk.corpus import words
  
print ("File : ", words.fileids())
  
print ("\nNo. of female names : ", len(words.words('en-basic')))
  
print ("\nNo. of male names : ", len(words.words('en')))

chevron_right


Output :

File :  ['en', 'en-basic']

No. of female names :  850

No. of male names :  235886


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.