Different phases of projected clustering in data analytics
In this article , we are going to discuss about different phases of projected clustering in data analytics in detail.
Three Phases for Projected Clustering :
- Initialization Phase
- Iterative Phase
- Refinement Phase
These are explained as following below.
1. Initialization Phase :
This phase comprises of two steps to select the superset.
- In the first step, it picks up a random sample data points whose size is proportional to the number of clusters that the user wish to produce which is given as,
S= random sample size A.k,
where A is a constant and k represents the number of clusters.
- The second step which uses the greedy method is accomplished to acquire a final set of points B.k,where B is a small constant.
This set is designated as M where hill climbing technique is put in during the next phase.
- Pick up a sample set of data point randomly.
- Pick up a set of data point which is probably the medoids of the cluster.
2. Iterative Phase :
From the initialization phase, we got a set of data points which should hold the medoids. This phase, we will find the best medoids from M. Randomly picks up the set of points M current, and restore the “bad” medoids from other point in M if required by which cluster quality is upgraded. The freshly formed meaningful medoid set is designated as M best.
For the medoids, following will be done as follows.
- Identify dimensions associated to the medoids.
- Allocate data points to the medoids.
- Gauge the clusters formed.
- Identify the poor medoid , and try the result of restoring poor medoid.
- The above procedure is replicate until we got a pleased result.
3. Refinement Phase -Handle Outliers :
- The end step of this algorithm is refinement phase. This phase comprises of better quality of the clusters formed.
- The clusters C1,C2,C3,….,Ck formed during the iterative phase are the feed in to this phase.
- The native data set is passed over one or more times to enhance the quality of the clusters.
- The dimension sets Di found during the iterative phase are dispose of and new dimension sets are calculated for each of the cluster set Ci.
- Once when the new dimensions are calculated for the clusters, then the points are reassigned to the medoids comparative to these new sets of dimensions.
- Outliers are determined in the last pass over the data.
Major Drawback :
- The algorithm requires the average number of dimensions per cluster as framework in input. The performance of projected clustering is highly sensitized to the value of its input framework.
- If the average number of dimensions is erroneously estimated ,the presentation of projected clustering significantly worsens.