Python – Variations of Principal Component Analysis

• Last Updated : 26 Jun, 2021

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction and visualisation technique. It is often referred to as a linear technique because the mapping of new features is given by the multiplication of feature by the matrix of PCA eigenvectors. It works by simply identifying the hyperplane that lies close to the data and then projects the data onto it in order to maximize the variance. Due to the simplistic approach PCA follows, it is widely used in data mining, bioinformatics, psychology, etc. Most of us are unaware of the fact that there are various versions of this algorithm out there which are better than the conventional approach. Let’s look at them one by one.

Randomized PCA:
This is an extension to PCA which uses approximated Singular Value Decomposition(SVD) of data. Conventional PCA works in O(n*p2) + O(p3) where n is the number of data points and p is the number of features whereas randomized version works in O(n*d*2) + O(d3) where d is the number of principal components. Thus, it is blazing fast when d is much smaller than n
sklearn provides a method randomized_svd in sklearn.utils.extmath which can be used to do randomized PCA. This method returns three matrices: U which is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript. Another way to use sklearn.decomposition.PCA and change the svd_solver hyperparameter from ‘auto’ to ‘randomized’ or ‘full’. However, Scikit-learn automatically uses randomized PCA if either p or n exceeds 500 or the number of principal components is less than 80% of p and n

Code:

Python3

 # Python3 program to show the working of# randomized PCA # importing librariesimport numpy as npfrom sklearn.decomposition import PCAfrom sklearn.utils.extmath import randomized_svd # dummy dataX = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) # creates instance of PCA with randomized svd_solverpca = PCA(n_components = 2, svd_solver ='randomized') # This function takes a matrix and returns the# U, Sigma and V ^ T elementsU, S, VT = randomized_svd(X, n_components = 2) # matrix returned by randomized_svdprint(f"Matrix U of size m * m: {U}\n")print(f"Matrix S of size m * n: {S}\n")print(f"Matrix V ^ T of size n * n: {VT}\n") # fitting the pca modelpca.fit(X) # printing the explained variance ratioprint("Explained Variance using PCA with randomized svd_solver:", pca.explained_variance_ratio_)

Output:

Matrix U of size m*m: [[ 0.21956688 -0.53396977]
[ 0.35264795  0.45713538]
[ 0.57221483 -0.07683439]
[-0.21956688  0.53396977]
[-0.35264795 -0.45713538]
[-0.57221483  0.07683439]]

Matrix S of size m*n: [6.30061232 0.54980396]

Matrix V^T of size n*n: [[-0.83849224 -0.54491354]
[-0.54491354  0.83849224]]

Explained Variance using PCA with randomized svd_solver: [0.99244289 0.00755711]

Incremental PCA:
The major problem with PCA and most of the dimensionality reduction algorithms is that they require whole data to fit in the memory at a single time and as the data is very huge at times thus it becomes very difficult to fit in memory.
Fortunately, there is an algorithm called Incremental PCA which is useful for large training datasets as it splits the data into min-batches and feeds it to Incremental PCA one batch at a time. This is called as on-the-fly learning. As not much data is present in the memory at a time thus memory usage is controlled.
Scikit-Learn provides us with a class called as sklearn.decomposition.IncrementalPCA using which we can implement this.

Code:

Python3

 # Python3 program to show the working of# incremental PCA # importing librariesimport numpy as npfrom sklearn.decomposition import IncrementalPCA # dummy dataX = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) # specify the number of batchesno_of_batches = 3 # create an instance of IncrementalPCAincremental_pca = IncrementalPCA(n_components = 2) # fit the data in batchesfor batch in np.array_split(X, no_of_batches):  incremental_pca.fit(batch) # fit and transform the datafinal = incremental_pca.transform(X) # prints an 2d-array (as n_components = 2)print(final)

Output:

[[-4.24264069e+00  7.07106781e-01]
[-4.94974747e+00  1.41421356e+00]
[-6.36396103e+00  1.41421356e+00]
[-1.41421356e+00  7.07106781e-01]
[-7.07106781e-01 -5.55111512e-17]
[ 7.07106781e-01  5.55111512e-17]]

Kernal PCA:
Kernel PCA is yet another extension of PCA using a kernel. The kernel is a mathematical technique using which we can map instances to very high dimensional space called the feature space, enabling non-linear classification and regression with Support Vector Machines(SVM). This is usually employed in novelty detections and image de-noising.
Scikit-Learn provides a class KernelPCA in sklearn.decomposition which can be used to perform Kernel PCA.

Code:

Python3

 # Python3 program to show the working of# Kernel PCA # importing librariesimport numpy as npfrom sklearn.decomposition import KernelPCA # dummy dataX = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) # creating an instance of KernelPCA using rbf kernelkernel_pca = KernelPCA(n_components = 2, kernel ="rbf", gamma = 0.03) # fit and transform the datafinal = kernel_pca.fit_transform(X) # prints an 2d-array (as n_components = 2)print(final)

Output:

[[-0.3149893  -0.17944928]
[-0.46965347 -0.0475298 ]
[-0.62541667  0.22697909]
[ 0.3149893  -0.17944928]
[ 0.46965347 -0.0475298 ]
[ 0.62541667  0.22697909]]

KernelPCA is unsupervised thus there is no obvious measure to select the best kernel. As we usually use dimensionality reduction as a step in supervised learning algorithms so we can use a pipeline with GridSearchCV for selecting optimal hyperparameters and then using those hyperparameters (kernel and gamma) to get the best classification accuracy.

My Personal Notes arrow_drop_up