Prerequisite: K-means Clustering – Introduction
Drawback of standard K-means algorithm:
One disadvantage of the K-means algorithm is that it is sensitive to the initialization of the centroids or the mean points. So, if a centroid is initialized to be a “far-off” point, it might just end up with no points associated with it and at the same time more than one clusters might end up linked with a single centroid. Similarly, more than one centroids might be initialized into the same cluster resulting in poor clustering. For example, consider the images shown below.
A poor initialisation of centroids resulted in poor clustering.
This is how the clustering should have been:
To overcome the above-mentioned drawback we use K-means++. This algorithm ensures a smarter initialization of the centroids and improves the quality of the clustering. Apart from initialization, the rest of the algorithm is the same as the standard K-means algorithm. That is K-means++ is the standard K-means algorithm coupled with a smarter initialization of the centroids.
The steps involved are:
- Randomly select the first centroid from the data points.
- For each data point compute its distance from the nearest, previously choosen centroid.
- Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
- Repeat steps 2 and 3 untill k centroids have been sampled
By following the above procedure for initialization, we pick up centroids which are far away from one another. This increases the chances of initially picking up centroids that lie in different clusters. Also, since centroids are picked up from the data points, each centroid has some data points associated with it at the end.
Consider a data-set having the following distribution:
Code : Python code for KMean++ Algorithm
Note: Although the initialization in K-means++ is computationally more expensive than the standard K-means algorithm, the run-time for convergence to optimum is drastically reduced for K-means++. This is because the centroids that are initially chosen are likely to lie in different clusters already.
- Different Types of Clustering Algorithm
- Asynchronous Advantage Actor Critic (A3C) algorithm
- Facebook News Feed Algorithm
- Gradient Descent algorithm and its variants
- k-nearest neighbor algorithm in Python
- ML | T-distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
- ML | Mini Batch K-means clustering algorithm
- ML | Expectation-Maximization Algorithm
- ML | Reinforcement Learning Algorithm : Python Implementation using Q-learning
- Genetic Algorithm for Reinforcement Learning : Python implementation
- Silhouette Algorithm to determine the optimal value of k
- Elbow Method for optimal value of k in KMeans
- Implementing DBSCAN algorithm using Sklearn
- ML | ECLAT Algorithm
- Implementing Apriori algorithm in Python
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.