NLP | Wordlist Corpus
Last Updated :
20 Feb, 2019
What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.
How to create wordlist corpus?
Code #1 : Creating a wordlist corpus
from nltk.corpus.reader import WordListCorpusReader
x = WordListCorpusReader( '.' , [ 'C:\\Users\\dell\\Desktop\\wordlist.txt' ])
x.words()
x.fileids()
|
Output :
['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']
['C:\\Users\\dell\\Desktop\\wordlist.txt']
Code #2 : Accessing raw.
x.raw()
from nltk.tokenize import line_tokenize
print ( "Wordlist : " , line_tokenize(x.raw()))
|
Output :
'geeks\r\nfor\r\ngeeks\r\nwelcomes\r\nyou\r\nto\r\nnlp\r\narticles'
Wordlist : ['geeks', 'for', 'geeks', 'welcomes', 'you', 'to', 'nlp', 'articles']
Code #3 : Accessing Name Wordlist corpus
from nltk.corpus import names
print ( "Path : " , names.fileids())
print ( "\nNo. of female names : " , len (names.words( 'female.txt' )))
print ( "\nNo. of male names : " , len (names.words( 'male.txt' )))
|
Output :
Path : ['female.txt', 'male.txt']
No. of female names : 5001
No. of male names : 2943
Code #4 : Accessing English Wordlist corpus
from nltk.corpus import words
print ( "File : " , words.fileids())
print ( "\nNo. of female names : " , len (words.words( 'en-basic' )))
print ( "\nNo. of male names : " , len (words.words( 'en' )))
|
Output :
File : ['en', 'en-basic']
No. of female names : 850
No. of male names : 235886
Share your thoughts in the comments
Please Login to comment...