Clustering is a technique in unsupervised machine learning which groups data points into clusters based on the similarity of information available for the data points in the dataset. The data points belonging to the same clusters are similar to each other in some ways while the data items belonging to different clusters are dissimilar.
K-means and DBScan (Density Based Spatial Clustering of Applications with Noise) are two of the most popular clustering algorithms in unsupervised machine learning.
1. K-Means Clustering :
K-means is a centroid-based or partition-based clustering algorithm. This algorithm partitions all the points in the sample space into K groups of similarity. The similarity is usually measured using Euclidian Distance .
The algorithm is as follows :
- K centroids are randomly placed, one for each cluster.
- Distance of each point from each centroid is calculated
- Each data point is assigned to its closest centroid, forming a cluster.
- The position of K centroids are recalculated.
2. DBScan Clustering :
DBScan is a density-based clustering algorithm. The key fact of this algorithm is that the neighbourhood of each point in a cluster which is within a given radius (R) must have a minimum number of points (M). This algorithm has proved extremely efficient in detecting outliers and handling noise.
The algorithm is as follows :
- The type of each point is determined. Each data point in our dataset may be either of the following :
- Core Point: A data point is a core point if, there are at least M points in its neighborhood ie, within the specified radius (R).
- Border Point: A data point is classified as a BORDER point if:
- Its neighborhood contains less than M data points, or
- It is reachable from some core point ie, it is within R-distance from a core point.
- Outlier Point: An outlier is a point that is not a core point, and also, is not close enough to be reachable from a core point.
- The outlier points are eliminated.
- Core points that are neighbors are connected and put in the same cluster.
- The border points are assigned to each cluster.
There are some notable differences between K-means and DBScan.
|S.No.||K-means Clustering||DBScan Clustering|
|1.||Clusters formed are more or less spherical or convex in shape and must have same feature size.||Clusters formed are arbitrary in shape and may not have same feature size.|
|2.||K-means clustering is sensitive to the number of clusters specified.||Number of clusters need not be specified.|
|3.||K-means Clustering is more efficient for large datasets.||DBSCan Clustering can not efficiently handle high dimensional datasets.|
|4.||K-means Clustering does not work well with outliers and noisy datasets.||DBScan clustering efficiently handles outliers and noisy datasets.|
|5.||In the domain of anomaly detection, this algorithm causes problems as anomalous points will be assigned to the same cluster as “normal” data points.||DBScan algorithm, on the other hand, locates regions of high density that are separated from one another by regions of low density.|
|6.||It requires one parameter : Number of clusters (K)|
It requires two parameters : Radius(R) and Minimum Points(M)
R determines a chosen radius such that if it includes enough points within it, it is a dense area.
M determines the minimum number of data points required in a neighborhood to be defined as a cluster.
|7.||Varying densities of the data points doesn’t affect K-means clustering algorithm.||DBScan clustering does not work very well for sparse datasets or for data points with varying density.|
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.