Demonstration of K-Means Assumptions

We explore scenarios that reveal the strengths and limitations of the algorithm in this Scikit-learn investigation of K-means assumptions. We study the sensitivity of K-means to incorrect cluster sizes, the difficulties it faces with anisotropic distributions, the difficulties it faces with different cluster variances, and the issue of unevenly sized clusters using synthetic datasets. We hope that this visual representation of these assumptions will clarify the applicability of K-means and emphasize the significance of choosing clustering algorithms that are specific to the features of the data.

K-Means Clustering

An unsupervised machine learning technique called K-means clustering is used to divide a dataset into discrete groups or clusters according to patterns of similarity in the data. The method involves allocating data points to clusters iteratively and optimizing the centroids of each cluster to reduce the total squared distance between each data point and the assigned centroids. K-means is a scalable and effective way to find underlying structures in data, and it is widely used for segmentation and pattern recognition tasks. K-means, for all its simplicity, may be difficult for some datasets with complex structures because it is sensitive to initial cluster centroids and assumes spherical, equally sized clusters.

The Assumptions in K-Means Clustering

Before we dive into the code, let’s thoroughly explain the two fundamental assumptions that underlie K-Means clustering:

Spherical shape and isotropic: Clusters are assumed by K-means to be spherical and isotropic, which means that their radius is approximately equal in all directions. The cluster center is assigned to the mean that the algorithm determines from the average of the data points within a cluster. Because of this presumption, K-means is susceptible to non-spherical or elongated clusters.
Equal Variance: All clusters are assumed to have the same variance by K-means. This indicates that for every cluster, the distribution of data points around the cluster center is approximately the same. K-means might not work well if the variances of the clusters differ noticeably.
Cluster sizes:Cluster size similarity is assumed by K-means algorithm. Clusters with more data points will affect the cluster mean more because the algorithm assigns each data point to the cluster with the closest mean. The algorithm might not correctly depict the underlying data distribution if the clusters have sizes that are significantly out of balance.
Anisotropicly distributed data: When data points in K-means clustering have anisotropic distribution, they indicate non-spherical, elongated clusters with differing spreads along various dimensions. As a result, K-means’ spherical cluster assumption is broken, which reduces accuracy. For such intricate data structures, other techniques like Gaussian Mixture Models might be more appropriate.

Implementation of Demonstration of k-means assumptions in Scikit Learn

Importing Libraries

Python3

# immporting Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs
 
plt.figure(figsize=(10, 10))
 
# Custom parameters

n_samples_custom = 1600

random_state_custom = 42

This code uses scikit-learn’s make_blobs to create a synthetic dataset with 1600 samples. The data is then subjected to K-means clustering with n_clusters=3 and the clusters are shown in a scatter plot. For reproducibility, the parameters random_state_custom and n_samples_custom regulate the size and randomness of the dataset.

Generating Blobs

Python3

# Generate blobs with different characteristics

X_custom, y_custom = make_blobs(

    n_samples=n_samples_custom, random_state=random_state_custom)

This code creates a synthetic dataset with n_samples_custom samples using the scikit-learn make_blobs function. Reproducibility is guaranteed by the random_state_custom parameter, which fixes the random seed. The final dataset’s feature and label assignments are made to X_custom and Y_custom, respectively.

Incorrect Number of Clusters

Python3

# Incorrect number of clusters

kmeans_1 = KMeans(n_clusters=2, random_state=random_state_custom)

y_pred_custom_1 = kmeans_1.fit_predict(X_custom)
 
plt.subplot(221)

plt.scatter(X_custom[:, 0], X_custom[:, 1], c=y_pred_custom_1)

plt.title("Incorrect Number of Blobs")

Output:

This code applies K-means clustering with n_clusters=2 to the synthetic dataset X_custom. Next, using plt.scatter, the final cluster assignments (y_pred_custom_1) are shown in a scatter plot. The “Incorrect Number of Blobs” subplot is a portion of a larger figure made with plt.subplot(221).

Anisotropicly Distributed Clusters

Python3

# Anisotropicly distributed data

transformation_custom = [[0.5, -0.8], [-0.3, 0.6]]

X_aniso_custom = np.dot(X_custom, transformation_custom)

kmeans_2 = KMeans(n_clusters=3, random_state=random_state_custom)

y_pred_custom_2 = kmeans_2.fit_predict(X_aniso_custom)
 
plt.subplot(222)

plt.scatter(X_aniso_custom[:, 0], X_aniso_custom[:, 1], c=y_pred_custom_2)

plt.title("Anisotropicly Distributed Blobs")

Output:

Applying a linear transformation (transformation_custom) to the original features of the dataset, this code adds anisotropy to it. Next, n clusters=3 K-means is used to cluster the transformed data (X_aniso_custom). In a scatter plot contained in a subplot named “Anisotropically Distributed Blobs,” the final cluster assignments (y_pred_custom_2) are displayed.

Unequal Variance

Python3

# Different variance

X_varied_custom, _ = make_blobs(n_samples=n_samples_custom, cluster_std=[

                                1.0, 3.0, 0.5], random_state=random_state_custom)

kmeans_3 = KMeans(n_clusters=3, random_state=random_state_custom)

y_pred_custom_3 = kmeans_3.fit_predict(X_varied_custom)
 
plt.subplot(223)

plt.scatter(X_varied_custom[:, 0], X_varied_custom[:, 1], c=y_pred_custom_3)

plt.title("Unequal Variance")

Output:

Using the make_blobs function, this code creates a dataset (X_varied_custom) with different cluster standard deviations. Next, K-means clustering with n_clusters=3 is applied to the dataset, and a scatter plot with the title “Unequal Variance” is used to visualize the cluster assignments (y_pred_custom_3).

Unevenly Sized Blobs

Python3

# Unevenly sized blobs

X_filtered_custom = np.vstack(

    (X_custom[y_custom == 0][:500], X_custom[y_custom == 1][:100], X_custom[y_custom == 2][:10]))

kmeans_4 = KMeans(n_clusters=3, random_state=random_state_custom)

y_pred_custom_4 = kmeans_4.fit_predict(X_filtered_custom)
 
plt.subplot(224)

plt.scatter(X_filtered_custom[:, 0],

            X_filtered_custom[:, 1], c=y_pred_custom_4)

plt.title("Unevenly Sized Blobs")
 
plt.show()

Output:

This code takes the original dataset (X_custom) and selects varying numbers of samples from each cluster to create an unevenly sized dataset (X_filtered_custom). After the dataset has been modified, K-means clustering with n_clusters=3 is applied. The cluster assignments that result (y_pred_custom_4) are then shown using a scatter plot in a subplot named “Unevenly Sized Blobs”. Plt.show() displays the complete figure.

Conclusion

In this Scikit-Learn demonstration of K-means assumptions, we methodically investigated scenarios where the algorithm’s assumptions could be broken. K-means sensitivity to the number of clusters was first observed when we started with incorrectly sized clusters. The introduction of the anisotropic distribution highlighted the limitations of K-means in managing non-spherical clusters, since the algorithm forms spherical clusters by default. The investigation of clusters with varying variances highlighted the difficulty K-means faces when dealing with unevenly distributed clusters. Lastly, blobs with different sizes showed how sensitive the algorithm was to changes in cluster sizes. Every scenario showed possible dangers, emphasizing how crucial it is to comprehend K-means assumptions and select suitable clustering methods in accordance with the features of the dataset. More reliable solutions might be provided by alternative techniques like Gaussian Mixture Models for complex structures like anisotropic or unevenly sized clusters. This demonstration highlights the importance of carefully selecting algorithms based on data properties and offers helpful insights for practitioners navigating the subtleties of clustering algorithms.

Article Tags :

Geeks Premier League

Machine Learning

Geeks Premier League 2023

Python scikit-module