Principal Component Analysis (PCA) is an unsupervised dimensionality reduction and visualisation technique. It is often referred to as a linear technique because the mapping of new features is given by the multiplication of feature by the matrix of PCA eigenvectors. It works by simply identifying the hyperplane that lies close to the data and then projects the data onto it in order to maximize the variance. Due to the simplistic approach PCA follows, it is widely used in data mining, bioinformatics, psychology, etc. Most of us are unaware of the fact that there are various versions of this algorithm out there which are better than the conventional approach. Let’s look at them one by one.
This is an extension to PCA which uses approximated Singular Value Decomposition(SVD) of data. Conventional PCA works in O(n*p2) + O(p3) where n is the number of data points and p is the number of features whereas randomized version works in O(n*d*2) + O(d3) where d is the number of principal components. Thus, it is blazing fast when d is much smaller than n.
sklearn provides a method
randomized_svd in sklearn.utils.extmath which can be used to do randomized PCA. This method returns three matrices: U which is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript. Another way to use sklearn.decomposition.PCA and change the
svd_solver hyperparameter from ‘auto’ to ‘randomized’ or ‘full’. However, Scikit-learn automatically uses randomized PCA if either p or n exceeds 500 or the number of principal components is less than 80% of p and n.
Matrix U of size m*m: [[ 0.21956688 -0.53396977] [ 0.35264795 0.45713538] [ 0.57221483 -0.07683439] [-0.21956688 0.53396977] [-0.35264795 -0.45713538] [-0.57221483 0.07683439]] Matrix S of size m*n: [6.30061232 0.54980396] Matrix V^T of size n*n: [[-0.83849224 -0.54491354] [-0.54491354 0.83849224]] Explained Variance using PCA with randomized svd_solver: [0.99244289 0.00755711]
The major problem with PCA and most of the dimensionality reduction algorithms is that they require whole data to fit in the memory at a single time and as the data is very huge at times thus it becomes very difficult to fit in memory.
Fortunately, there is an algorithm called Incremental PCA which is useful for large training datasets as it splits the data into min-batches and feeds it to Incremental PCA one batch at a time. This is called as on-the-fly learning. As not much data is present in the memory at a time thus memory usage is controlled.
Scikit-Learn provides us with a class called as
sklearn.decomposition.IncrementalPCA using which we can implement this.
[[-4.24264069e+00 7.07106781e-01] [-4.94974747e+00 1.41421356e+00] [-6.36396103e+00 1.41421356e+00] [-1.41421356e+00 7.07106781e-01] [-7.07106781e-01 -5.55111512e-17] [ 7.07106781e-01 5.55111512e-17]]
Kernel PCA is yet another extension of PCA using a kernel. The kernel is a mathematical technique using which we can map instances to very high dimensional space called the feature space, enabling non-linear classification and regression with Support Vector Machines(SVM). This is usually employed in novelty detections and image de-noising.
Scikit-Learn provides a class KernelPCA in
sklearn.decomposition which can be used to perform Kernel PCA.
[[-0.3149893 -0.17944928] [-0.46965347 -0.0475298 ] [-0.62541667 0.22697909] [ 0.3149893 -0.17944928] [ 0.46965347 -0.0475298 ] [ 0.62541667 0.22697909]]
KernelPCA is unsupervised thus there is no obvious measure to select the best kernel. As we usually use dimensionality reduction as a step in supervised learning algorithms so we can use a pipeline with GridSearchCV for selecting optimal hyperparameters and then using those hyperparameters (kernel and gamma) to get the best classification accuracy.
- Principal Component Analysis with Python
- ML | Principal Component Analysis(PCA)
- ML | Independent Component Analysis
- Data analysis and Visualization with Python
- Twitter Sentiment Analysis using Python
- Text Analysis in Python 3
- Analysis of Different Methods to find Prime Number in Python
- Analysis of test data using K-Means Clustering in Python
- Replacing strings with numbers in Python for Data Analysis
- Data Analysis and Visualization with Python | Set 2
- Exploratory Data Analysis in Python
- Python | Math operations for Data analysis
- Python | NLP analysis of Restaurant reviews
- Multidimensional data analysis in Python
- Exploratory Data Analysis in Python | Set 1
- Exploratory Data Analysis in Python | Set 2
- Python | Sentiment Analysis using VADER
- Python | CAP - Cumulative Accuracy Profile analysis
- Facebook Sentiment Analysis using python
- Python | Customer Churn Analysis Prediction
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.