Principal Component Analysis (PCA) is an unsupervised dimensionality reduction and visualisation technique. It is often referred to as a linear technique because the mapping of new features is given by the multiplication of feature by the matrix of PCA eigenvectors. It works by simply identifying the hyperplane that lies close to the data and then projects the data onto it in order to maximize the variance. Due to the simplistic approach PCA follows, it is widely used in data mining, bioinformatics, psychology, etc. Most of us are unaware of the fact that there are various versions of this algorithm out there which are better than the conventional approach. Let’s look at them one by one.
This is an extension to PCA which uses approximated Singular Value Decomposition(SVD) of data. Conventional PCA works in O(n*p2) + O(p3) where n is the number of data points and p is the number of features whereas randomized version works in O(n*d*2) + O(d3) where d is the number of principal components. Thus, it is blazing fast when d is much smaller than n.
sklearn provides a method
randomized_svd in sklearn.utils.extmath which can be used to do randomized PCA. This method returns three matrices: U which is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript. Another way to use sklearn.decomposition.PCA and change the
svd_solver hyperparameter from ‘auto’ to ‘randomized’ or ‘full’. However, Scikit-learn automatically uses randomized PCA if either p or n exceeds 500 or the number of principal components is less than 80% of p and n.
Matrix U of size m*m: [[ 0.21956688 -0.53396977] [ 0.35264795 0.45713538] [ 0.57221483 -0.07683439] [-0.21956688 0.53396977] [-0.35264795 -0.45713538] [-0.57221483 0.07683439]] Matrix S of size m*n: [6.30061232 0.54980396] Matrix V^T of size n*n: [[-0.83849224 -0.54491354] [-0.54491354 0.83849224]] Explained Variance using PCA with randomized svd_solver: [0.99244289 0.00755711]
The major problem with PCA and most of the dimensionality reduction algorithms is that they require whole data to fit in the memory at a single time and as the data is very huge at times thus it becomes very difficult to fit in memory.
Fortunately, there is an algorithm called Incremental PCA which is useful for large training datasets as it splits the data into min-batches and feeds it to Incremental PCA one batch at a time. This is called as on-the-fly learning. As not much data is present in the memory at a time thus memory usage is controlled.
Scikit-Learn provides us with a class called as
sklearn.decomposition.IncrementalPCA using which we can implement this.
[[-4.24264069e+00 7.07106781e-01] [-4.94974747e+00 1.41421356e+00] [-6.36396103e+00 1.41421356e+00] [-1.41421356e+00 7.07106781e-01] [-7.07106781e-01 -5.55111512e-17] [ 7.07106781e-01 5.55111512e-17]]
Kernel PCA is yet another extension of PCA using a kernel. The kernel is a mathematical technique using which we can map instances to very high dimensional space called the feature space, enabling non-linear classification and regression with Support Vector Machines(SVM). This is usually employed in novelty detections and image de-noising.
Scikit-Learn provides a class KernelPCA in
sklearn.decomposition which can be used to perform Kernel PCA.
[[-0.3149893 -0.17944928] [-0.46965347 -0.0475298 ] [-0.62541667 0.22697909] [ 0.3149893 -0.17944928] [ 0.46965347 -0.0475298 ] [ 0.62541667 0.22697909]]
KernelPCA is unsupervised thus there is no obvious measure to select the best kernel. As we usually use dimensionality reduction as a step in supervised learning algorithms so we can use a pipeline with GridSearchCV for selecting optimal hyperparameters and then using those hyperparameters (kernel and gamma) to get the best classification accuracy.