Flipkart Reviews Sentiment Analysis using Python

Last Updated : 26 Oct, 2022

This article is based on the analysis of the reviews and ratings user gives on Flipkart to make others aware of their experience and moreover about the quality of the product and brand. So, by analyzing that data we can tell the users a lot about the products and also the ways to enhance the quality of the product.

Today we will be using Machine Learning to analyze that data and make it more efficient to understand and prediction ready.

Our task is to predict whether the review given is positive or negative.

Before starting the code, download the dataset by clicking this link.

Importing Libraries and Datasets

The libraries used are :

Pandas : For importing the dataset.
Scikit-learn : For importing the model, accuracy module, and TfidfVectorizer.
Warning : To ignore all the warnings
Matplotlib : To plot the visualization. Also used Wordcloud for that.
Seaborn : For data visualization.

Python3

import warnings 
warnings.filterwarnings('ignore') 
import pandas as pd 
import re 
import seaborn as sns 
from sklearn.feature_extraction.text import TfidfVectorizer 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud

For text analysis, we will be using NLTK Library. From that we will be requiring stopword. So let’s download and import it using the below command.

Python3

import nltk 
nltk.download('stopwords') 
from nltk.corpus import stopwords

After that import the downloaded dataset using the below code.

Python3

data = pd.read_csv('flipkart_data.csv') 
data.head()

Output :

Preprocessing and cleaning the reviews

As, the real data is multi-labelled, so firstly explore those labels then we will convert them into 2 classes.

Python3

# unique ratings 
pd.unique(data['rating'])

Output:

array([5, 4, 1, 3, 2], dtype=int64)

Let’s see the countplot for the same.

Python3

sns.countplot(data=data, 
              x='rating', 
              order=data.rating.value_counts().index)

Output :

To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change the rating column into an another column of 0s and 1s category. For that the condition will be like if the rating is less than or equal to 4, then it is negative(0) else positive(1). For better understanding, refer the code below.

Python3

# rating label(final) 
pos_neg = [] 
for i in range(len(data['rating'])): 
    if data['rating'][i] >= 5: 
        pos_neg.append(1) 
    else: 
        pos_neg.append(0) 
  
data['label'] = pos_neg 

Let’s create the function to preprocess the dataset

Python3

from tqdm import tqdm 
  
  
def preprocess_text(text_data): 
    preprocessed_text = [] 
  
    for sentence in tqdm(text_data): 
        # Removing punctuations 
        sentence = re.sub(r'[^\w\s]', '', sentence) 
  
        # Converting lowercase and removing stopwords 
        preprocessed_text.append(' '.join(token.lower() 
                                          for token in nltk.word_tokenize(sentence) 
                                          if token.lower() not in stopwords.words('english'))) 
  
    return preprocessed_text 

Now, we can implement this function for the dataset. The code for that is given below.

Python3

preprocessed_review = preprocess_text(data['review'].values) 
data['review'] = preprocessed_review

Once we have done with the preprocess. Let’s see the top 5 rows to see the improved dataset.

Python3

data.head()

Output :

Analysis of the Dataset

Let’s check out that how many counts are there for positive and negative sentiments.

Python3

data["label"].value_counts()

Output :

1    5726
0    4250

To have the better picture of the importance of the words let’s create the Wordcloud of all the words with label = 1 i.e. positive

Python3

consolidated = ' '.join( 
    word for word in data['review'][data['label'] == 1].astype(str)) 
wordCloud = WordCloud(width=1600, height=800, 
                      random_state=21, max_font_size=110) 
plt.figure(figsize=(15, 10)) 
plt.imshow(wordCloud.generate(consolidated), interpolation='bilinear') 
plt.axis('off') 
plt.show() 

Output :

Now it’s clear that the words like good, nice, product have high frequency in positive review, which satisfies our assumptions.

Let’s create the vectors.

Converting text into Vectors

TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.

Python3

cv = TfidfVectorizer(max_features=2500) 
X = cv.fit_transform(data['review'] ).toarray()

To Print the X generated

Python3

X

Output:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Model training, Evaluation, and Prediction

Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.

Python3

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, data['label'], 
                                                    test_size=0.33, 
                                                    stratify=data['label'], 
                                                    random_state = 42)

Now we can train any model, Let’s explore the Decision Tree for the prediction.

Python3

from sklearn.tree import DecisionTreeClassifier 
  
model = DecisionTreeClassifier(random_state=0) 
model.fit(X_train,y_train) 
  
#testing the model 
pred = model.predict(X_train) 
print(accuracy_score(y_train,pred))

Output :

0.9244351339218914

Let’s see the confusion matrix for the results.

Python3

from sklearn import metrics 
cm = confusion_matrix(y_train,pred) 
  
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,  
                                            display_labels = [False, True]) 
  
cm_display.plot() 
plt.show()

Output :

Conclusion

Decision Tree Classifier is performing well with this data. In future, we can also work with large data by scraping it through the website.

Suggest improvement

Online Payment Fraud Detection using Machine Learning in Python

Loan Approval Prediction using Machine Learning

Share your thoughts in the comments

Classification Projects

Regression Projects

Computer Vision Projects

Natural Language Processing Projects

Clustering Projects

Recommender System Project

Flipkart Reviews Sentiment Analysis using Python

Importing Libraries and Datasets

Python3

Python3

Python3

Preprocessing and cleaning the reviews

Python3

Python3

Python3

Python3

Python3

Python3

Analysis of the Dataset

Python3

Python3

Converting text into Vectors

Python3

Python3

Model training, Evaluation, and Prediction

Python3

Python3

Python3

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?