Open In App

Implementing PCA in Python with scikit-learn

In this article, we will learn about PCA (Principal Component Analysis) in Python with scikit-learn. Let’s start our learning step by step.

WHY PCA?



How does PCA work?

Python Implementation:



1. Import all the libraries




# import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

2. Loading Data

Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has 569 data items with 30 input attributes. There are two output classes-benign and malignant. Due to 30 input features, it is impossible to visualize this data




#import the breast _cancer dataset
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()
 
# Check the output classes
print(data['target_names'])
 
# Check the input attributes
print(data['feature_names'])

Output:

3. Apply PCA 

Let us select it to 3. After executing this code, we get to know that the dimensions of x are (569,3) while the dimension of actual data is (569,30). Thus, it is clear that with PCA, the number of dimensions has reduced to 3 from 30. If we choose n_components=2, the dimensions would be reduced to 2.




# construct a dataframe using pandas
df1=pd.DataFrame(data['data'],columns=data['feature_names'])
 
# Scale data before applying PCA
scaling=StandardScaler()
 
# Use fit and transform method
scaling.fit(df1)
Scaled_data=scaling.transform(df1)
 
# Set the n_components=3
principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)
 
# Check the dimensions of data after PCA
print(x.shape)

Output:

(569,3)

4. Check Components

The principal.components_ provide an array in which the number of rows tells the number of principal components while the number of columns is equal to the number of features in actual data.  We can easily see that there are three rows as n_components was chosen to be 3. However, each row has 30 columns as in actual data.




# Check the values of eigen vectors
# prodeced by principal components
principal.components_

5. Plot the components (Visualization) 

Plot the principal components for better data visualization.  Though we had taken n_components =3, here we are plotting a 2d graph as well as 3d using first two principal components and 3 principal components respectively. For three principal components, we need to plot a 3d graph. The colors show the 2 output classes of the original dataset-benign and malignant. It is clear that principal components show clear separation between two output classes. 




plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')

Output:

For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal component. Similarly, x[:,1] and x[:,2] represent the second and the third principal component.




# import relevant libraries for 3d graph
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
 
# choose projection 3d for creating a 3d graph
axis = fig.add_subplot(111, projection='3d')
 
# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3
axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)

Output:

6. Calculate variance ratio

Explained_variance_ratio provides an idea of how much variation is explained by principal components. 




# check how much variance is explained by each principal component
print(principal.explained_variance_ratio_)

Output:

array([0.44272026, 0.18971182, 0.09393163])

 


Article Tags :