Customer Segmentation using Unsupervised Machine Learning in Python

In today’s era, companies work hard to make their customers happy. They launch new technologies and services so that customers can use their products more. They try to be in touch with each of their customers so that they can provide goods accordingly. But practically, it’s very difficult and non-realistic to keep in touch with everyone. So, here comes the usage of Customer Segmentation.

Customer Segmentation means the segmentation of customers on the basis of their similar characteristics, behavior, and needs. This will eventually help the company in many ways. Like, they can launch the product or enhance the features accordingly. They can also target a particular sector as per their behaviors. All of these lead to an enhancement in the overall market value of the company.

Customer Segmentation using Unsupervised Machine Learning in Python

Today we will be using Machine Learning to implement the task of Customer Segmentation.

Import Libraries

The libraries we will be required are :

Pandas – This library helps to load the data frame in a 2D array format.
Numpy – Numpy arrays are very fast and can perform large computations.
Matplotlib / Seaborn – This library is used to draw visualizations.
Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sb
 
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.cluster import KMeans
 
import warnings

warnings.filterwarnings('ignore')

Importing Dataset

The dataset taken for the task includes the details of customers includes their marital status, their income, number of items purchased, types of items purchased, and so on.

Python3

df = pd.read_csv('new.csv')
df.head()

Output:

To check the shape of the dataset we can use data.shape method.

Python3

df.shape

Output:

(2240, 25)(2240, 25)

To get the information of the dataset like checking the null values, count of values, etc. we will use .info() method.

Data Preprocessing

Python3

df.info()

Output:

Python3

df.describe().T

Output:

Improving the values in the Accepted column.

Python3

df['Accepted'] = df['Accepted'].str.replace('Accepted', '')

To check the null values in the dataset.

Python3

for col in df.columns:

    temp = df[col].isnull().sum()

    if temp > 0:

        print(f'Column {col} contains {temp} null values.')

Output:

 Column Income contains 24 null values.

Now, once we have the count of the null values and we know the values are very less we can drop them (it will not affect the dataset much).

Python3

df = df.dropna()

print("Total missing values are:", len(df))

Output:

 Total missing values are: 2216

To find the total number of unique values in each column we can use data.unique() method.

Python3

df.nunique()

Output:

Here we can observe that there are columns which contain single values in the whole column so, they have no relevance in the model development.

Also dataset has a column Dt_Customer which contains the date column, we can convert into 3 columns i.e. day, month, year.

Python3

parts = df["Dt_Customer"].str.split("-", n=3, expand=True)

df["day"] = parts[0].astype('int')

df["month"] = parts[1].astype('int')

df["year"] = parts[2].astype('int')

Now we have all the important features, we can now drop features like Z_CostContact, Z_Revenue, Dt_Customer.

Python3

df.drop(['Z_CostContact', 'Z_Revenue', 'Dt_Customer'],

        axis=1,

        inplace=True)

Data Visualization and Analysis

Data visualization is the graphical representation of information and data in a pictorial or graphical format. Here we will be using bar plot and count plot for better visualization.

Python3

floats, objects = [], []

for col in df.columns:

    if df[col].dtype == object:

        objects.append(col)

    elif df[col].dtype == float:

        floats.append(col)
 
print(objects)

print(floats)

Output:

['Education', 'Marital_Status', 'Accepted']
['Income']

To get the count plot for the columns of the datatype – object, refer the code below.

Python3

plt.subplots(figsize=(15, 10))

for i, col in enumerate(objects):

    plt.subplot(2, 2, i + 1)

    sb.countplot(df[col])
plt.show()

Output:

Let’s check the value_counts of the Marital_Status of the data.

Python3

df['Marital_Status'].value_counts()

Output:

Now lets see the comparison of the features with respect to the values of the responses.

Python3

plt.subplots(figsize=(15, 10))

for i, col in enumerate(objects):

    plt.subplot(2, 2, i + 1)

    sb.countplot(df[col], hue=df['Response'])
plt.show()

Output:

Label Encoding

Label Encoding is used to convert the categorical values into the numerical values so that model can understand it.

Python3

for col in df.columns:

    if df[col].dtype == object:

        le = LabelEncoder()

        df[col] = le.fit_transform(df[col])

Heatmap is the best way to visualize the correlation among the different features of dataset. Let’s give it the value of 0.8

Python3

plt.figure(figsize=(15, 15))

sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()

Output:

Standardization

Standardization is the method of feature scaling which is an integral part of feature engineering. It scales down the data and making it easier for the machine learning model to learn from it. It reduces the mean to ‘0’ and the standard deviation to ‘1’.

Python3

scaler = StandardScaler()

data = scaler.fit_transform(df)

Segmentation

We will be using T-distributed Stochastic Neighbor Embedding. It helps in visualizing high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the values to low-dimensional embedding.

Python3

from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)

tsne_data = model.fit_transform(df)

plt.figure(figsize=(7, 7))

plt.scatter(tsne_data[:, 0], tsne_data[:, 1])
plt.show()

Output:

There are certainly some clusters which are clearly visual from the 2-D representation of the given data. Let’s use the KMeans algorithm to find those clusters in the high dimensional plane itself

KMeans Clustering can also be used to cluster the different points in a plane.

Python3

error = []

for n_clusters in range(1, 21):

    model = KMeans(init='k-means++',

                   n_clusters=n_clusters,

                   max_iter=500,

                   random_state=22)

    model.fit(df)

    error.append(model.inertia_)

Here inertia is nothing but the sum of squared distances within the clusters.

Python3

plt.figure(figsize=(10, 5))

sb.lineplot(x=range(1, 21), y=error)

sb.scatterplot(x=range(1, 21), y=error)
plt.show()

Output:

Here by using the elbow method we can say that k = 6 is the optimal number of clusters that should be made as after k = 6 the value of the inertia is not decreasing drastically.

Python3

# create clustering model with optimal k=5

model = KMeans(init='k-means++',

               n_clusters=5,

               max_iter=500,

               random_state=22)

segments = model.fit_predict(df)

Scatterplot will be used to see all the 6 clusters formed by KMeans Clustering.

Python3

plt.figure(figsize=(7, 7))

sb.scatterplot(tsne_data[:, 0], tsne_data[:, 1], hue=segments)
plt.show()

Output:

Article Tags :

Machine Learning

Python