Python – Variations of Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction and visualisation technique. It is often referred to as a linear technique because the mapping of new features is given by the multiplication of feature by the matrix of PCA eigenvectors. It works by simply identifying the hyperplane that lies close to the data and then projects the data onto it in order to maximize the variance. Due to the simplistic approach PCA follows, it is widely used in data mining, bioinformatics, psychology, etc. Most of us are unaware of the fact that there are various versions of this algorithm out there which are better than the conventional approach. Let’s look at them one by one.

Randomized PCA:
This is an extension to PCA which uses approximated Singular Value Decomposition(SVD) of data. Conventional PCA works in O(n*p2) + O(p3) where n is the number of data points and p is the number of features whereas randomized version works in O(n*d*2) + O(d3) where d is the number of principal components. Thus, it is blazing fast when d is much smaller than n.
sklearn provides a method randomized_svd in sklearn.utils.extmath which can be used to do randomized PCA. This method returns three matrices: U which is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript. Another way to use sklearn.decomposition.PCA and change the svd_solver hyperparameter from ‘auto’ to ‘randomized’ or ‘full’. However, Scikit-learn automatically uses randomized PCA if either p or n exceeds 500 or the number of principal components is less than 80% of p and n.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# Python3 program to show the working of
# randomized PCA
  
# importing libraries
import numpy as np
from sklearn.decomposition import PCA
from sklearn.utils.extmath import randomized_svd
  
# dummy data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  
# creates instance of PCA with randomized svd_solver
pca = PCA(n_components = 2, svd_solver ='randomized')
  
# This function takes a matrix and returns the 
# U, Sigma and V ^ T elements
U, S, VT = randomized_svd(X, n_components = 2)
  
# matrix returned by randomized_svd
print(f"Matrix U of size m * m: {U}\n")
print(f"Matrix S of size m * n: {S}\n")
print(f"Matrix V ^ T of size n * n: {VT}\n")
  
# fitting the pca model
pca.fit(X)
  
# printing the explained variance ratio
print("Explained Variance using PCA with randomized svd_solver:", pca.explained_variance_ratio_)

chevron_right


Output:

Matrix U of size m*m: [[ 0.21956688 -0.53396977]
 [ 0.35264795  0.45713538]
 [ 0.57221483 -0.07683439]
 [-0.21956688  0.53396977]
 [-0.35264795 -0.45713538]
 [-0.57221483  0.07683439]]

Matrix S of size m*n: [6.30061232 0.54980396]

Matrix V^T of size n*n: [[-0.83849224 -0.54491354]
 [-0.54491354  0.83849224]]

Explained Variance using PCA with randomized svd_solver: [0.99244289 0.00755711]

Incremental PCA:
The major problem with PCA and most of the dimensionality reduction algorithms is that they require whole data to fit in the memory at a single time and as the data is very huge at times thus it becomes very difficult to fit in memory.



Fortunately, there is an algorithm called Incremental PCA which is useful for large training datasets as it splits the data into min-batches and feeds it to Incremental PCA one batch at a time. This is called as on-the-fly learning. As not much data is present in the memory at a time thus memory usage is controlled.

Scikit-Learn provides us with a class called as sklearn.decomposition.IncrementalPCA using which we can implement this.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# Python3 program to show the working of
# incremental PCA
  
# importing libraries
import numpy as np
from sklearn.decomposition import IncrementalPCA
  
# dummy data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  
# specify the number of batches
no_of_batches = 3
  
# create an instance of IncrementalPCA
incremental_pca = IncrementalPCA(n_components = 2)
  
# fit the data in batches
for batch in np.array_split(X, no_of_batches):
  incremental_pca.fit(batch)
  
# fit and tranform the data 
final = incremental_pca.transform(X)
  
# prints an 2d-array (as n_components = 2)
print(final)

chevron_right


Output:

[[-4.24264069e+00  7.07106781e-01]
 [-4.94974747e+00  1.41421356e+00]
 [-6.36396103e+00  1.41421356e+00]
 [-1.41421356e+00  7.07106781e-01]
 [-7.07106781e-01 -5.55111512e-17]
 [ 7.07106781e-01  5.55111512e-17]]

Kernal PCA:
Kernel PCA is yet another extension of PCA using a kernel. The kernel is a mathematical technique using which we can map instances to very high dimensional space called the feature space, enabling non-linear classification and regression with Support Vector Machines(SVM). This is usually employed in novelty detections and image de-noising.
Scikit-Learn provides a class KernelPCA in sklearn.decomposition which can be used to perform Kernel PCA.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# Python3 program to show the working of
# Kernel PCA
  
# importing libraries
import numpy as np
from sklearn.decomposition import KernelPCA
  
# dummy data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  
# creating an instance of KernelPCA using rbf kernel
kernel_pca = KernelPCA(n_components = 2, kernel ="rbf", gamma = 0.03)
  
# fit and transform the data
final = kernel_pca.fit_transform(X)
  
# prints an 2d-array (as n_components = 2)
print(final)

chevron_right


Output:

[[-0.3149893  -0.17944928]
 [-0.46965347 -0.0475298 ]
 [-0.62541667  0.22697909]
 [ 0.3149893  -0.17944928]
 [ 0.46965347 -0.0475298 ]
 [ 0.62541667  0.22697909]]

KernelPCA is unsupervised thus there is no obvious measure to select the best kernel. As we usually use dimensionality reduction as a step in supervised learning algorithms so we can use a pipeline with GridSearchCV for selecting optimal hyperparameters and then using those hyperparameters (kernel and gamma) to get the best classification accuracy.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.