Clustering High-Dimensional Data in Data Mining

Last Updated : 22 Mar, 2022

Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.

Clustering in Data Mining

Challenges of Clustering High-Dimensional Data:

Clustering of the High-Dimensional Data return the group of objects which are clusters. It is required to group similar types of objects together to perform the cluster analysis of high-dimensional data, But the High-Dimensional data space is huge and it has complex data types and attributes. A major challenge is that we need to find out the set of attributes that are present in each cluster. A cluster is defined and characterized based on the attributes present in the cluster. Clustering High-Dimensional Data we need to search for clusters and find out the space for the existing clusters.

The High-Dimensional data is reduced to low-dimension data to make the clustering and search for clusters simple. some applications need the appropriate models of clusters, especially the high-dimensional data. clusters in the high-dimensional data are significantly small. the conventional distance measures can be ineffective. Instead, To find the hidden clusters in high-dimensional data we need to apply sophisticated techniques that can model correlations among the objects in subspaces.

Subspace Clustering Methods:

There are 3 Subspace Clustering Methods:

Subspace search methods
Correlation-based clustering methods
Biclustering methods

Subspace clustering approaches to search for clusters existing in subspaces of the given high-dimensional data space, where a subspace is defined using a subset of attributes in the full space.

Overview of the different high-dimensional data clustering

1. Subspace Search Methods: A subspace search method searches the subspaces for clusters. Here, the cluster is a group of similar types of objects in a subspace. The similarity between the clusters is measured by using distance or density features. CLIQUE algorithm is a subspace clustering method. subspace search methods search a series of subspaces. There are two approaches in Subspace Search Methods: Bottom-up approach starts to search from the low-dimensional subspaces. If the hidden clusters are not found in low-dimensional subspaces then it searches in higher dimensional subspaces. The top-down approach starts to search from the high-dimensional subspaces and then search in subsets of low-dimensional subspaces. Top-down approaches are effective if the subspace of a cluster can be defined by the local neighborhood sub-space clusters.

2. Correlation-Based Clustering: correlation-based approaches discover the hidden clusters by developing advanced correlation models. Correlation-Based models are preferred if is not possible to cluster the objects by using the Subspace Search Methods. Correlation-Based clustering includes the advanced mining techniques for correlation cluster analysis. Biclustering Methods are the Correlation-Based clustering methods in which both the objects and attributes are clustered.

3. Biclustering Methods:

Biclustering means clustering the data based on the two factors. we can cluster both objects and attributes at a time in some applications. The resultant clusters are biclusters. To perform the biclustering there are four requirements:

Only a small set of objects participate in a cluster.
A cluster only involves a small number of attributes.
The data objects can take part in multiple clusters, or the objects may also include in any cluster.
An attribute may be involved in multiple clusters.

Objects and attributes are not treated in the same way. Objects are clustered according to their attribute values. We treat Objects and attributes as different in biclustering analysis.

Suggest improvement

Outlier Detection in High-Dimensional Data in Data Mining

Share your thoughts in the comments