Skip to content
Related Articles
Open in App
Not now

Related Articles

ML | Principal Component Analysis(PCA)

Improve Article
Save Article
Like Article
  • Difficulty Level : Medium
  • Last Updated : 13 Mar, 2023
Improve Article
Save Article
Like Article

This method was introduced by Karl Pearson. It works on a condition that while the data in a higher dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a large dataset. It is a commonly used method in machine learning, data science, and other fields that deal with large datasets.

  1. PCA works by identifying patterns in the data and then creating new variables that capture as much of the variation in the data as possible. These new variables, known as principal components, are linear combinations of the original variables in the dataset.
  2. The first principal component captures the most variation in the data, the second captures the second most, and so on. The number of principal components created is equal to the number of original variables in the dataset.
  3. PCA can be used for a variety of purposes, including data visualization, feature selection, and data compression. In data visualization, PCA can be used to plot high-dimensional data in two or three dimensions, making it easier to interpret. In feature selection, PCA can be used to identify the most important variables in a dataset. In data compression, PCA can be used to reduce the size of a dataset without losing important information.

Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets, making them easier to understand and work with.

Principal Component Analysis (PCA) is to reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, retains most of the sample’s information and useful for the compression and classification of data.

In PCA, it is assumed that the information is carried in the variance of the features, that is, the higher the variation in a feature, the more information that feature carries.

Hence, PCA employs a linear transformation that is based on preserving the most variance in the data using the least number of dimensions. It involves the following steps:

1. Construct the covariance matrix of the data.

 2. Compute the eigenvectors of this matrix.

3. Eigenvectors corresponding to the largest eigen values are used to reconstruct a large fraction of variance of the original data.

The data instances are projected onto a lower dimensional space where the new features best represent the entire data in the least squares sense.

It can be shown that the optimal approximation, in the least square error sense, of a d-dimensional random vector x2< d by a linear combination of independent vectors is obtained by projecting the vector x onto the eigenvectors e1 corresponding to the largest eigen values l1 of the covariance matrix (or the scatter matrix) of the data from which x is drawn.

The eigenvectors of the covariance matrix of the data are referred to as principal axes of the data, and the projection of the data instances on to these principal axes are called the principal components. Dimensionality reduction is then obtained by only retaining those axes (dimensions) that account for most of the variance, and discarding all others.

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of uncorrelated variables.PCA is the most widely used tool in exploratory data analysis and in machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best fit. Module Needed: 

Python3




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Code #1: 

Python3




# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer
 
# instantiating
cancer = load_breast_cancer()
 
# creating dataframe
df = pd.DataFrame(cancer['data'], columns = cancer['feature_names'])
 
# checking head of dataframe
df.head()

Output:out  Code #2: 

Python3




# Importing standardscalar module
from sklearn.preprocessing import StandardScaler
 
scalar = StandardScaler()
 
# fitting
scalar.fit(df)
scaled_data = scalar.transform(df)
 
# Importing PCA
from sklearn.decomposition import PCA
 
# Let's say, components = 2
pca = PCA(n_components = 2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
 
x_pca.shape

Output:

569, 2

Python3




# giving a larger plot
plt.figure(figsize =(8, 6))
 
plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer['target'], cmap ='plasma')
 
# labeling x and y axes
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

Output:m

Python3




# components
pca.components_

Output:out  

Python3




df_comp = pd.DataFrame(pca.components_, columns = cancer['feature_names'])
 
plt.figure(figsize =(14, 6))
 
# plotting heatmap
sns.heatmap(df_comp)

Output:out

Advantages of PCA:

  1. Dimensionality Reduction: PCA is a popular technique used for dimensionality reduction, which is the process of reducing the number of variables in a dataset. By reducing the number of variables, PCA simplifies data analysis, improves performance, and makes it easier to visualize data.
  2. Feature Selection: PCA can be used for feature selection, which is the process of selecting the most important variables in a dataset. This is useful in machine learning, where the number of variables can be very large, and it is difficult to identify the most important variables.
  3. Data Visualization: PCA can be used for data visualization. By reducing the number of variables, PCA can plot high-dimensional data in two or three dimensions, making it easier to interpret.

Disadvantages of PCA:

  1. Interpretation of Principal Components: The principal components created by PCA are linear combinations of the original variables, and it is often difficult to interpret them in terms of the original variables. This can make it difficult to explain the results of PCA to others.
  2. Data Scaling: PCA is sensitive to the scale of the data. If the data is not properly scaled, then PCA may not work well. Therefore, it is important to scale the data before applying PCA.
  3. Information Loss: PCA can result in information loss. While PCA reduces the number of variables, it can also lead to loss of information. The degree of information loss depends on the number of principal components selected. Therefore, it is important to carefully select the number of principal components to retain.

Reference Book:

“An Introduction to Statistical Learning” by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani is a great reference book for learning about PCA and other machine learning techniques. It provides a comprehensive overview of statistical learning and covers topics such as linear regression, logistic regression, classification, and clustering. The book is written in a clear and concise manner, making it accessible to both students and professionals in the field of machine learning. Additionally, the book provides practical examples and exercises to help readers apply the concepts they have learned.


My Personal Notes arrow_drop_up
Like Article
Save Article
Related Articles

Start Your Coding Journey Now!