One of the primary disadvantages of any clustering technique is that it is difficult to evaluate its performance. To tackle this problem, the metric of V-Measure was developed. The calculation of the V-Measure first requires the calculation of two terms:-
- Homogeneity: A perfectly homogeneous clustering is one where each cluster has data-points belonging to the same class label. Homogeneity describes the closeness of the clustering algorithm to this perfection.
- Completeness: A perfectly complete clustering is one where all data-points belonging to the same class are clustered into the same cluster. Completeness describes the closeness of the clustering algorithm to this perfection.
Trivial Homogeneity: It is the case when the number of clusters is equal to the number of data points and each point is in exactly one cluster. It is the extreme case when homogeneity is highest while completeness is minimum.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import v_measure_score
|
Step 2: Loading and Cleaning the data
# Changing the working location to the location of the file cd C:\Users\Dev\Desktop\Kaggle\Credit Card Fraud # Loading the data df = pd.read_csv( 'creditcard.csv' )
# Separating the dependent and independent variables y = df[ 'Class' ]
X = df.drop( 'Class' , axis = 1 )
X.head() |
# List of V-Measure Scores for different models v_scores = []
# List of different types of covariance parameters N_Clusters = [ 2 , 3 , 4 , 5 , 6 ]
|
a) n_clusters = 2
# Building the clustering model kmeans2 = KMeans(n_clusters = 2 )
# Training the clustering model kmeans2.fit(X) # Storing the predicted Clustering labels labels2 = kmeans2.predict(X)
# Evaluating the performance v_scores.append(v_measure_score(y, labels2)) |
b) n_clusters = 3
# Building the clustering model kmeans3 = KMeans(n_clusters = 3 )
# Training the clustering model kmeans3.fit(X) # Storing the predicted Clustering labels labels3 = kmeans3.predict(X)
# Evaluating the performance v_scores.append(v_measure_score(y, labels3)) |
c) n_clusters = 4
# Building the clustering model kmeans4 = KMeans(n_clusters = 4 )
# Training the clustering model kmeans4.fit(X) # Storing the predicted Clustering labels labels4 = kmeans4.predict(X)
# Evaluating the performance v_scores.append(v_measure_score(y, labels4)) |
d) n_clusters = 5
# Building the clustering model kmeans5 = KMeans(n_clusters = 5 )
# Training the clustering model kmeans5.fit(X) # Storing the predicted Clustering labels labels5 = kmeans5.predict(X)
# Evaluating the performance v_scores.append(v_measure_score(y, labels5)) |
e) n_clusters = 6
# Building the clustering model kmeans6 = KMeans(n_clusters = 6 )
# Training the clustering model kmeans6.fit(X) # Storing the predicted Clustering labels labels6 = kmeans6.predict(X)
# Evaluating the performance v_scores.append(v_measure_score(y, labels6)) |
Step 4: Visualizing the results and comparing the performances
# Plotting a Bar Graph to compare the models plt.bar(N_Clusters, v_scores) plt.xlabel( 'Number of Clusters' )
plt.ylabel( 'V-Measure Score' )
plt.title( 'Comparison of different Clustering Models' )
plt.show() |