Amazon Product Reviews Sentiment Analysis in Python

Amazon gives a platform to small businesses and companies with modest resources to grow larger. And Because of its popularity, people actually spend time and write detailed reviews, about the brand and the product. So, by analyzing that data we can tell companies a lot about their products and also the ways to enhance the quality of the product. But that large amount of data can not be analyzed by a person.

So here comes the Machine learning part, i.e. Natural Language Processing (NLP) to overcome the problem of large datasets and analyze it. Our task is to predict whether the review given is positive or negative. The real dataset after scraping the website might include millions of reviews. So we preprocessed the data for you,

Before starting the code, download the dataset by clicking the link

Steps to be followed

  1. Importing Libraries and Datasets
  2. Preprocessing and cleaning the reviews 
  3. Analysis of the Dataset
  4. Converting text into Vectors
  5. Model training, Evaluation, and Prediction

Let’s start with the code now.

Importing Libraries and Datasets

The libraries used are : 

import warnings
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud

For NLP part, we will be using NLTK Library. From that we will be requiring stopword and punkt. so let’s download and import them using the below command.

import nltk'punkt')'stopwords')
from nltk.corpus import stopwords

After that import the downloaded dataset using the below code.

data = pd.read_csv('AmazonReview.csv')

Output :


Preprocessing and cleaning the reviews


Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     24999 non-null  object
 1   Sentiment  25000 non-null  int64 

Now, To drop the null values (if any), run the below command.


To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change them the values to those categories. For that the condition will be like if the sentiment value is less than or equal to 3, then it is negative(0) else positive(1). For better understanding, refer the code below.

#1,2,3->negative(i.e 0)
data.loc[data['Sentiment']<=3,'Sentiment'] = 0
#4,5->positive(i.e 1)
data.loc[data['Sentiment']>3,'Sentiment'] = 1

Now, once the dataset is ready, we will clean the review column by removing the stopwords. The code for that is given below.

def clean_review(review):
  cleanreview=" ".join(word for word in review.
                       split() if word not in stp_words)
  return cleanreview

Once we have done with the preprocess. Let’s see the top 5 rows to see the improved dataset.


Output :


Analysis of the Dataset

Let’s check out that how many counts are there for positive and negative sentiments.


Output : 

0    15000
1     9999

To have the better picture of the importance of the words let’s create the Wordcloud of all the words with sentiment = 0 i.e. negative

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==0].astype(str))

Output :


Let’s do the same for all the words with sentiment = 1 i.e. positive

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==1].astype(str))

Output :


Now we have a clear picture of the words we have in both the categories.

Let’s create the vectors.

Converting text into Vectors

TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.

cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(data['Review'] ).toarray()

Model training, Evaluation, and Prediction

Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.

from sklearn.model_selection import train_test_split
x_train ,x_test,y_train,y_test=train_test_split(X,data['Sentiment'],
                                                test_size=0.25 ,

Now we can train any model, Let’s explore the Logistic Regression.

from sklearn.linear_model import LogisticRegression
#Model fitting,y_train)
#testing the model
#model accuracy

Output :


Let’s see the confusion matrix for the results.

from sklearn import metrics
cm = confusion_matrix(y_test,pred)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,
                                            display_labels = [False, True])

Output : 


