Processing text using NLP | Basics

Prerequisite: Introduction to NLP

In this article, we are going to discuss how we can obtain text from online text files and extract the required data from them. For the purpose of this article, we will be using the text file available here.

The following must be installed in current working environment:

  • NLTK library
  • urllib library
  • BeautifulSoup library

Step #1: import the required libraries

filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen

chevron_right


Some basic information about the above mentioned libraries:

  • NLTK Library: The nltk library is a collection of libraries and programs written for processing of English language written in Python programming language.
  • urllib library: This is a URL handling library for python. Know more about it here
  • BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.

Step #2: To extract all the contents of the text file.

filter_none

edit
close

play_arrow

link
brightness_4
code

chevron_right


Thus, the unprocessed data is loaded into the variable raw.

Step #3: Next, we process the data to remove any html/xml tags which might be present in our ‘raw’ variable using:

filter_none

edit
close

play_arrow

link
brightness_4
code

raw1 = BeautifulSoup(raw)

chevron_right


Step #4: Now we obtain the text present in ‘raw’ variable.

filter_none

edit
close

play_arrow

link
brightness_4
code

raw2 = raw1.get_text()

chevron_right


Output:

Step #5: Next we tokenize the text into words.

filter_none

edit
close

play_arrow

link
brightness_4
code

token = nltk.word_tokenize(raw2)

chevron_right


Output:

This is done as preprocessing for the next step, where we will obtain final text.

Step #6: Finally, we obtain our final text.

filter_none

edit
close

play_arrow

link
brightness_4
code

text2 = ' '.join(token)

chevron_right


Output:

 
Below is the complete code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing libraries
import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen
  
# extract all the contents of the text file.
  
# remove any html/xml tags
raw1 = BeautifulSoup(raw)
  
#  obtain the text present in ‘raw’
raw2 = raw1.get_text()
  
#  tokenize the text into words.
token = nltk.word_tokenize(raw2)
text2 = ' '.join(token)

chevron_right




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.