Open In App

Amazon Product Reviews Sentiment Analysis in Python

Last Updated : 21 Nov, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Amazon gives a platform to small businesses and companies with modest resources to grow larger. And Because of its popularity, people actually spend time and write detailed reviews, about the brand and the product. So, by analyzing that data we can tell companies a lot about their products and also the ways to enhance the quality of the product. But that large amount of data can not be analyzed by a person.

Amazon Product Reviews Sentiment Analysis in Python

So here comes the Machine learning part, i.e. Natural Language Processing (NLP) to overcome the problem of large datasets and analyze it. Our task is to predict whether the review given is positive or negative. The real dataset after scraping the website might include millions of reviews. So we preprocessed the data for you,

Before starting the code, download the dataset by clicking the link

Steps to be followed

  1. Importing Libraries and Datasets
  2. Preprocessing and cleaning the reviews 
  3. Analysis of the Dataset
  4. Converting text into Vectors
  5. Model training, Evaluation, and Prediction

Let’s start with the code now.

Importing Libraries and Datasets

The libraries used are : 

Python3




import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud


For NLP part, we will be using NLTK Library. From that we will be requiring stopword and punkt. so let’s download and import them using the below command.

Python3




import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords


After that import the downloaded dataset using the below code.

Python3




data = pd.read_csv('AmazonReview.csv')
data.head()


Output :

 

Preprocessing and cleaning the reviews 

Python3




data.info()


Output:

Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     24999 non-null  object
 1   Sentiment  25000 non-null  int64 

Now, To drop the null values (if any), run the below command.

Python3




data.dropna(inplace=True)


To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change them the values to those categories. For that the condition will be like if the sentiment value is less than or equal to 3, then it is negative(0) else positive(1). For better understanding, refer the code below.

Python3




#1,2,3->negative(i.e 0)
data.loc[data['Sentiment']<=3,'Sentiment'] = 0
 
#4,5->positive(i.e 1)
data.loc[data['Sentiment']>3,'Sentiment'] = 1


Now, once the dataset is ready, we will clean the review column by removing the stopwords. The code for that is given below.

Python3




stp_words=stopwords.words('english')
def clean_review(review):
  cleanreview=" ".join(word for word in review.
                       split() if word not in stp_words)
  return cleanreview
 
data['Review']=data['Review'].apply(clean_review)


Once we have done with the preprocess. Let’s see the top 5 rows to see the improved dataset.

Python3




data.head()


Output :

 

Analysis of the Dataset

Let’s check out that how many counts are there for positive and negative sentiments.

Python3




data['Sentiment'].value_counts()


Output : 

0    15000
1     9999

To have the better picture of the importance of the words let’s create the Wordcloud of all the words with sentiment = 0 i.e. negative

Python3




consolidated=' '.join(word for word in data['Review'][data['Sentiment']==0].astype(str))
wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110)
plt.figure(figsize=(15,10))
plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear')
plt.axis('off')
plt.show()


Output :

WordCloud

 

Let’s do the same for all the words with sentiment = 1 i.e. positive

Python3




consolidated=' '.join(word for word in data['Review'][data['Sentiment']==1].astype(str))
wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110)
plt.figure(figsize=(15,10))
plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear')
plt.axis('off')
plt.show()


Output :

 

Now we have a clear picture of the words we have in both the categories.

Let’s create the vectors.

Converting text into Vectors

TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.

Python3




cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(data['Review'] ).toarray()


Model training, Evaluation, and Prediction

Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.

Python3




from sklearn.model_selection import train_test_split
x_train ,x_test,y_train,y_test=train_test_split(X,data['Sentiment'],
                                                test_size=0.25 ,
                                                random_state=42)


Now we can train any model, Let’s explore the Logistic Regression.

Python3




from sklearn.linear_model import LogisticRegression
 
model=LogisticRegression()
 
#Model fitting
model.fit(x_train,y_train)
 
#testing the model
pred=model.predict(x_test)
 
#model accuracy
print(accuracy_score(y_test,pred))


Output :

0.81632

Let’s see the confusion matrix for the results.

Python3




from sklearn import metrics
cm = confusion_matrix(y_test,pred)
 
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,
                                            display_labels = [False, True])
 
cm_display.plot()
plt.show()


Output : 

 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads