Principal Components Analysis in Data Mining

Dimension reduction is a necessary step in the effective analysis of massive high-dimensional datasets. It may be the main objective in Data Mining for the analysis and visualization of the high-dimensional data or it may be an intermediate step that enables some other analysis such as clustering.

The principal component analysis is a data reduction technique that transforms a large number of correlated variables into a smaller set of correlated variables called principal components. In simple terms, Principal Component Analysis is a method of extracting important variables from a large number of variables available in a dataset, it extracts a set of low-dimensional features from a high-dimensional dataset with the goal of capturing as much information as possible(variance) in the data.

A principal component analysis is mainly used as a dimensionality reduction technique in various artificial intelligence applications such as computer vision and image compression. It can also be used to find hidden patterns when the data has large dimensions. Some of the fields that use Principal Component Analysis are finance, data mining, psychology, etc.

Steps Involved in the Principal Component Analysis:

The main step involved in the Principal Component Analysis is given below:

Standardize the dataset.
Compute the covariance matrix for the features in the dataset.
Compute the eigenvalues and eigenvectors for the covariance matrix.
Sort the eigenvalues and their corresponding eigenvectors.
choose k eigenvalues to form an eigenvector matrix.
Transform the original matrix.

Uses:

There are many uses of Principal Component Analysis in Data Mining, Some of them are listed below:

It is used to find inter-relation between variables in the data.
It is used to interpret and visualize data.
The number of variables is decreased, which makes further analysis simpler.
it is often used to visualize genetic distance and relatedness between populations.

Advantages:

It helps in data compression and removes correlated features.
It helps in Speeding up other Data Mining Algorithms.
It converts high-dimensional data into low-dimensional data which improves and make visualization easy.

Disadvantages:

It may lead to some amount of data loss.
It tends to find linear correlations between variables, which is sometimes undesirable.
It fails in cases where mean and covariance are not enough to define datasets.

Article Tags :

Data Mining