Processing text using NLP | Basics
In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model.
Importing Libraries
The following must be installed in the current working environment:
- NLTK Library: The NLTK library is a collection of libraries and programs written for processing of English language written in Python programming language.
- urllib library: This is a URL handling library for python.
- BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.
Python3
import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen
|
Once importing all the libraries, we need to extract the text. Text can be in string datatype or a file that we have to process.
Extracting Data
For this article, we are using web scraping to read a webpage then we will be using get_text() function for changing it to str format.
Python3
raw1 = BeautifulSoup(raw)
raw2 = raw1.get_text()
raw2
|
Output :
Data Preprocessing
Once the data extraction is done, the data is now ready to process. For that follow these steps :
1. Deletion of Punctuations and numerical text
Python3
def punc(raw2):
raw2 = re.sub( '[^a-zA-Z]' , ' ' , raw2)
return raw2
|
2. Creating Tokens
Python3
def token(raw2):
tokens = nltk.word_tokenize(raw2)
return tokens
|
3. Removing Stopwords
Python3
def remove_(tokens):
final = [word.lower()
for word in tokens if word not in stopwords.words( "english" )]
return final
|
4. Lemmatization
Python3
from textblob import TextBlob
def lemma(final):
str1 = ' ' .join(final)
s = TextBlob(str1)
lemmatized_sentence = " " .join([w.lemmatize() for w in s.words])
return final
|
5. Joining the final tokens
Python3
def join_(final):
review = ' ' .join(final)
return ans
|
To execute the above functions refer this code :
Python3
raw2 = punc(raw2)
tokens = token(raw2)
final = remove_(tokens)
final = lemma(final)
ans = join_(final)
ans
|
Output :
Last Updated :
22 Sep, 2022
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...