Projected clustering in data analytics
In this article, we are going to discuss about projected clustering in data analytics.
Projected Clustering :
Projected clustering is the first, top-down partitioning projected clustering algorithm based on the notion of k- medoid clustering which was presented by Aggarwal (1999). It determines medoids for each cluster repetitively on a sample of data using a greedy hill climbing technique and then upgrades the results repetitively. Cluster quality in projected clustering is a function of average distance between data points and the closest medoid. Also, the subspace dimensionality is an input framework which generates clusters of alike sizes.
Features of Projected Clustering :
- Projected clustering is a typical- dimension – reduction subspace clustering method. That is, instead of initiating from single – dimensional spaces, it proceeds by identifying an initial approximation of the clusters in high dimensional attribute space.
- Each dimension is then allocated a weight for each cluster and the renovated weights are used in the next repetition to restore the clusters . This leads to the inspection of dense regions in all subspaces of some craved dimensionality.
- It avoids the production of a huge number of overlapped clusters in lower dimensionality.
- Projected clustering finds the finest set of medoids by a hill climbing technique but generalized to deal with projected clustering.
- It acquire a distance measure called Manhattan segmental distance.
- This algorithm composed of three phases : Initialization, iteration, cluster refinement.
- However, projected clustering is speedy than CLIQUE due to the sampling of large datasets, though the use of small number of illustrative points can cause this algorithm to miss out some clusters completely.
- Experiments on projected clustering show that the procedure is structured and scalable at finding high dimensional clusters. This algorithm finds non overlapped partitions of points.
Input and Output for Projected Clustering :
- The group of data points.
- Number of clusters, indicated by k.
- Average number of dimensions for each clusters, indicated by L.
- The clusters identified, and the dimensions esteemed to such clusters.