Prerequisite: Introduction to NLP
In this article, we are going to discuss how we can obtain text from online text files and extract the required data from them. For the purpose of this article, we will be using the text file available here.
The following must be installed in current working environment:
- NLTK library
- urllib library
- BeautifulSoup library
Step #1: import the required libraries
import nltk from bs4 import BeautifulSoup from urllib.request import urlopen |
Some basic information about the above mentioned libraries:
- NLTK Library: The nltk library is a collection of libraries and programs written for processing of English language written in Python programming language.
- urllib library: This is a URL handling library for python. Know more about it here
- BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.
Step #2: To extract all the contents of the text file.
|
Thus, the unprocessed data is loaded into the variable raw.
Step #3: Next, we process the data to remove any html/xml tags which might be present in our ‘raw’ variable using:
raw1 = BeautifulSoup(raw) |
Step #4: Now we obtain the text present in ‘raw’ variable.
raw2 = raw1.get_text() |
Output:
Step #5: Next we tokenize the text into words.
token = nltk.word_tokenize(raw2) |
Output:
This is done as preprocessing for the next step, where we will obtain final text.
Step #6: Finally, we obtain our final text.
text2 = ' ' .join(token) |
Output:
Below is the complete code:
# importing libraries import nltk from bs4 import BeautifulSoup from urllib.request import urlopen # extract all the contents of the text file. # remove any html/xml tags raw1 = BeautifulSoup(raw) # obtain the text present in ‘raw’ raw2 = raw1.get_text() # tokenize the text into words. token = nltk.word_tokenize(raw2) text2 = ' ' .join(token) |