Open In App

Customer Segmentation using Unsupervised Machine Learning in Python

Last Updated : 21 Nov, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In today’s era, companies work hard to make their customers happy. They launch new technologies and services so that customers can use their products more. They try to be in touch with each of their customers so that they can provide goods accordingly. But practically, it’s very difficult and non-realistic to keep in touch with everyone. So, here comes the usage of Customer Segmentation.

Customer Segmentation means the segmentation of customers on the basis of their similar characteristics, behavior, and needs. This will eventually help the company in many ways. Like, they can launch the product or enhance the features accordingly. They can also target a particular sector as per their behaviors. All of these lead to an enhancement in the overall market value of the company.

Customer Segmentation using Unsupervised Machine Learning in Python

Today we will be using Machine Learning to implement the task of Customer Segmentation.

Import Libraries

The libraries we will be required are : 

  • Pandas – This library helps to load the data frame in a 2D array format.
  • Numpy – Numpy arrays are very fast and can perform large computations.
  • Matplotlib / Seaborn – This library is used to draw visualizations.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3




import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
 
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
 
import warnings
warnings.filterwarnings('ignore')


Importing Dataset

The dataset taken for the task includes the details of customers includes their marital status, their income, number of items purchased, types of items purchased, and so on.

Python3




df = pd.read_csv('new.csv')
df.head()


Output:  

Customer Segmentation using Unsupervised Machine Learning - Input

 

To check the shape of the dataset we can use data.shape method.

Python3




df.shape


Output:

(2240, 25)(2240, 25)

To get the information of the dataset like checking the null values, count of values, etc. we will use .info() method.

Data Preprocessing

Python3




df.info()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Python3




df.describe().T


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Improving the values in the Accepted column.

Python3




df['Accepted'] = df['Accepted'].str.replace('Accepted', '')


To check the null values in the dataset.

Python3




for col in df.columns:
    temp = df[col].isnull().sum()
    if temp > 0:
        print(f'Column {col} contains {temp} null values.')


Output:

 Column Income contains 24 null values.

Now, once we have the count of the null values and we know the values are very less we can drop them (it will not affect the dataset much).

Python3




df = df.dropna()
print("Total missing values are:", len(df))


Output:

 Total missing values are: 2216

To find the total number of unique values in each column we can use data.unique() method. 

Python3




df.nunique()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Here we can observe that there are columns which contain single values in the whole column so, they have no relevance in the model development.

Also dataset has a column Dt_Customer which contains the date column, we can convert into 3 columns i.e. day, month, year. 

Python3




parts = df["Dt_Customer"].str.split("-", n=3, expand=True)
df["day"] = parts[0].astype('int')
df["month"] = parts[1].astype('int')
df["year"] = parts[2].astype('int')


Now we have all the important features, we can now drop features like Z_CostContact, Z_Revenue, Dt_Customer.

Python3




df.drop(['Z_CostContact', 'Z_Revenue', 'Dt_Customer'],
        axis=1,
        inplace=True)


Data Visualization and Analysis

Data visualization is the graphical representation of information and data in a pictorial or graphical format. Here we will be using bar plot and count plot for better visualization.

Python3




floats, objects = [], []
for col in df.columns:
    if df[col].dtype == object:
        objects.append(col)
    elif df[col].dtype == float:
        floats.append(col)
 
print(objects)
print(floats)


Output:

['Education', 'Marital_Status', 'Accepted']
['Income']

To get the count plot for the columns of the datatype – object, refer the code below.

Python3




plt.subplots(figsize=(15, 10))
for i, col in enumerate(objects):
    plt.subplot(2, 2, i + 1)
    sb.countplot(df[col])
plt.show()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Let’s check the value_counts of the Marital_Status of the data.

Python3




df['Marital_Status'].value_counts()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Now lets see the comparison of the features with respect to the values of the responses.

Python3




plt.subplots(figsize=(15, 10))
for i, col in enumerate(objects):
    plt.subplot(2, 2, i + 1)
    sb.countplot(df[col], hue=df['Response'])
plt.show()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Label Encoding  

Label Encoding is used to convert the categorical values into the numerical values so that model can understand it.

Python3




for col in df.columns:
    if df[col].dtype == object:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])


Heatmap is the best way to visualize the correlation among the different features of dataset. Let’s give it the value of 0.8

Python3




plt.figure(figsize=(15, 15))
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Standardization

Standardization is the method of feature scaling which is an integral part of feature engineering. It scales down the data and making it easier for the machine learning model to learn from it. It reduces the mean to ‘0’ and the standard deviation to ‘1’.

Python3




scaler = StandardScaler()
data = scaler.fit_transform(df)


Segmentation

We will be using T-distributed Stochastic Neighbor Embedding. It helps in visualizing high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the values to low-dimensional embedding.

Python3




from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
tsne_data = model.fit_transform(df)
plt.figure(figsize=(7, 7))
plt.scatter(tsne_data[:, 0], tsne_data[:, 1])
plt.show()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

There are certainly some clusters which are clearly visual from the 2-D representation of the given data. Let’s use the KMeans algorithm to find those clusters in the high dimensional plane itself

KMeans Clustering can also be used to cluster the different points in a plane.

Python3




error = []
for n_clusters in range(1, 21):
    model = KMeans(init='k-means++',
                   n_clusters=n_clusters,
                   max_iter=500,
                   random_state=22)
    model.fit(df)
    error.append(model.inertia_)


Here inertia is nothing but the sum of squared distances within the clusters.

Python3




plt.figure(figsize=(10, 5))
sb.lineplot(x=range(1, 21), y=error)
sb.scatterplot(x=range(1, 21), y=error)
plt.show()


Output:

Customer Segmentation using Unsupervised Machine Learning

 

Here by using the elbow method we can say that k = 6 is the optimal number of clusters that should be made as after k = 6 the value of the inertia is not decreasing drastically.

Python3




# create clustering model with optimal k=5
model = KMeans(init='k-means++',
               n_clusters=5,
               max_iter=500,
               random_state=22)
segments = model.fit_predict(df)


Scatterplot will be used to see all the 6 clusters formed by KMeans Clustering.

Python3




plt.figure(figsize=(7, 7))
sb.scatterplot(tsne_data[:, 0], tsne_data[:, 1], hue=segments)
plt.show()


Output:

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads