In this article , we are going to discuss about different phases of projected clustering in data analytics in detail.
Three Phases for Projected Clustering :
- Initialization Phase
- Iterative Phase
- Refinement Phase
These are explained as following below.
1. Initialization Phase :
This phase comprises of two steps to select the superset.
- In the first step, it picks up a random sample data points whose size is proportional to the number of clusters that the user wish to produce which is given as,
S= random sample size A.k,
where A is a constant and k represents the number of clusters.
- The second step which uses the greedy method is accomplished to acquire a final set of points B.k,where B is a small constant.
This set is designated as M where hill climbing technique is put in during the next phase.
- Pick up a sample set of data point randomly.
- Pick up a set of data point which is probably the medoids of the cluster.
2. Iterative Phase :
From the initialization phase, we got a set of data points which should hold the medoids. This phase, we will find the best medoids from M. Randomly picks up the set of points M current, and restore the “bad” medoids from other point in M if required by which cluster quality is upgraded. The freshly formed meaningful medoid set is designated as M best.
For the medoids, following will be done as follows.
- Identify dimensions associated to the medoids.
- Allocate data points to the medoids.
- Gauge the clusters formed.
- Identify the poor medoid , and try the result of restoring poor medoid.
- The above procedure is replicate until we got a pleased result.
3. Refinement Phase -Handle Outliers :
- The end step of this algorithm is refinement phase. This phase comprises of better quality of the clusters formed.
- The clusters C1,C2,C3,….,Ck formed during the iterative phase are the feed in to this phase.
- The native data set is passed over one or more times to enhance the quality of the clusters.
- The dimension sets Di found during the iterative phase are dispose of and new dimension sets are calculated for each of the cluster set Ci.
- Once when the new dimensions are calculated for the clusters, then the points are reassigned to the medoids comparative to these new sets of dimensions.
- Outliers are determined in the last pass over the data.
Major Drawback :
- The algorithm requires the average number of dimensions per cluster as framework in input. The performance of projected clustering is highly sensitized to the value of its input framework.
- If the average number of dimensions is erroneously estimated ,the presentation of projected clustering significantly worsens.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.