Skip to content
Related Articles

Related Articles

Processing text using NLP | Basics
  • Last Updated : 08 Apr, 2019

Prerequisite: Introduction to NLP

In this article, we are going to discuss how we can obtain text from online text files and extract the required data from them. For the purpose of this article, we will be using the text file available here.

The following must be installed in current working environment:

  • NLTK library
  • urllib library
  • BeautifulSoup library

Step #1: import the required libraries




import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen

Some basic information about the above mentioned libraries:



  • NLTK Library: The nltk library is a collection of libraries and programs written for processing of English language written in Python programming language.
  • urllib library: This is a URL handling library for python. Know more about it here
  • BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.

Step #2: To extract all the contents of the text file.

Thus, the unprocessed data is loaded into the variable raw.

Step #3: Next, we process the data to remove any html/xml tags which might be present in our ‘raw’ variable using:




raw1 = BeautifulSoup(raw)

Step #4: Now we obtain the text present in ‘raw’ variable.




raw2 = raw1.get_text()

Output:

Step #5: Next we tokenize the text into words.




token = nltk.word_tokenize(raw2)

Output:

This is done as preprocessing for the next step, where we will obtain final text.

Step #6: Finally, we obtain our final text.




text2 = ' '.join(token)

Output:

 
Below is the complete code:




# importing libraries
import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen
  
# extract all the contents of the text file.
  
# remove any html/xml tags
raw1 = BeautifulSoup(raw)
  
#  obtain the text present in ‘raw’
raw2 = raw1.get_text()
  
#  tokenize the text into words.
token = nltk.word_tokenize(raw2)
text2 = ' '.join(token)

machine-learning

My Personal Notes arrow_drop_up
Recommended Articles
Page :