Amazon Product Reviews Sentiment Analysis in Python

Last Updated : 21 Nov, 2022

Amazon gives a platform to small businesses and companies with modest resources to grow larger. And Because of its popularity, people actually spend time and write detailed reviews, about the brand and the product. So, by analyzing that data we can tell companies a lot about their products and also the ways to enhance the quality of the product. But that large amount of data can not be analyzed by a person.

Amazon Product Reviews Sentiment Analysis in Python

So here comes the Machine learning part, i.e. Natural Language Processing (NLP) to overcome the problem of large datasets and analyze it. Our task is to predict whether the review given is positive or negative. The real dataset after scraping the website might include millions of reviews. So we preprocessed the data for you,

Before starting the code, download the dataset by clicking the link.

Steps to be followed

Importing Libraries and Datasets
Preprocessing and cleaning the reviews
Analysis of the Dataset
Converting text into Vectors
Model training, Evaluation, and Prediction

Let’s start with the code now.

Importing Libraries and Datasets

The libraries used are :

Pandas : For importing the dataset.
Scikit-learn : For importing the model, accuracy module, and TfidfVectorizer.
Warning : To ignore all the warnings
Matplotlib : To plot the visualization. Also used Wordcloud for that.

Python3

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud

For NLP part, we will be using NLTK Library. From that we will be requiring stopword and punkt. so let’s download and import them using the below command.

Python3

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

After that import the downloaded dataset using the below code.

Python3

data = pd.read_csv('AmazonReview.csv')
data.head()

Output :

Preprocessing and cleaning the reviews

Python3

data.info()

Output:

Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     24999 non-null  object
 1   Sentiment  25000 non-null  int64

Now, To drop the null values (if any), run the below command.

Python3

data.dropna(inplace=True)

To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change them the values to those categories. For that the condition will be like if the sentiment value is less than or equal to 3, then it is negative(0) else positive(1). For better understanding, refer the code below.

Python3

#1,2,3->negative(i.e 0)
data.loc[data['Sentiment']<=3,'Sentiment'] = 0
 
#4,5->positive(i.e 1)
data.loc[data['Sentiment']>3,'Sentiment'] = 1

Now, once the dataset is ready, we will clean the review column by removing the stopwords. The code for that is given below.

Python3

stp_words=stopwords.words('english')
def clean_review(review): 
  cleanreview=" ".join(word for word in review.
                       split() if word not in stp_words)
  return cleanreview 
 
data['Review']=data['Review'].apply(clean_review)

Once we have done with the preprocess. Let’s see the top 5 rows to see the improved dataset.

Python3

data.head()

Output :

Analysis of the Dataset

Let’s check out that how many counts are there for positive and negative sentiments.

Python3

data['Sentiment'].value_counts()

Output :

0    15000
1     9999

To have the better picture of the importance of the words let’s create the Wordcloud of all the words with sentiment = 0 i.e. negative

Python3

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==0].astype(str))
wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110)
plt.figure(figsize=(15,10))
plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear')
plt.axis('off')
plt.show()

Output :

Let’s do the same for all the words with sentiment = 1 i.e. positive

Python3

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==1].astype(str))
wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110)
plt.figure(figsize=(15,10))
plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear')
plt.axis('off')
plt.show()

Output :

Now we have a clear picture of the words we have in both the categories.

Let’s create the vectors.

Converting text into Vectors

TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.

Python3

cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(data['Review'] ).toarray()

Model training, Evaluation, and Prediction

Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.

Python3

from sklearn.model_selection import train_test_split
x_train ,x_test,y_train,y_test=train_test_split(X,data['Sentiment'],
                                                test_size=0.25 ,
                                                random_state=42)

Now we can train any model, Let’s explore the Logistic Regression.

Python3

from sklearn.linear_model import LogisticRegression
 
model=LogisticRegression()
 
#Model fitting
model.fit(x_train,y_train)
 
#testing the model
pred=model.predict(x_test)
 
#model accuracy
print(accuracy_score(y_test,pred))

Output :

0.81632

Let’s see the confusion matrix for the results.

Python3

from sklearn import metrics
cm = confusion_matrix(y_test,pred)
 
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, 
                                            display_labels = [False, True])
 
cm_display.plot()
plt.show()

Output :

Suggest improvement

Flipkart Reviews Sentiment Analysis using Python

Share your thoughts in the comments

Amazon Product Reviews Sentiment Analysis in Python