In this article, we will learn about PCA (Principal Component Analysis) in Python with scikit-learn. Let’s start our learning step by step.
- When there are many input attributes, it is difficult to visualize the data. There is a very famous term ‘Curse of dimensionality’ in the machine learning domain.
- Basically, it refers to the fact that a higher number of attributes in a dataset adversely affects the accuracy and training time of the machine learning model.
- Principal Component Analysis (PCA) is a way to address this issue and is used for better data visualization and improving accuracy.
How does PCA work?
- PCA is an unsupervised pre-processing task that is carried out before applying any ML algorithm. PCA is based on “orthogonal linear transformation” which is a mathematical technique to project the attributes of a data set onto a new coordinate system. The attribute which describes the most variance is called the first principal component and is placed at the first coordinate.
- Similarly, the attribute which stands second in describing variance is called a second principal component and so on. In short, the complete dataset can be expressed in terms of principal components. Usually, more than 90% of the variance is explained by two/three principal components.
- Principal component analysis, or PCA, thus converts data from high dimensional space to low dimensional space by selecting the most important attributes that capture maximum information about the dataset.
- To implement PCA in Scikit learn, it is essential to standardize/normalize the data before applying PCA.
- PCA is imported from sklearn.decomposition. We need to select the required number of principal components.
- Usually, n_components is chosen to be 2 for better visualization but it matters and depends on data.
- By the fit and transform method, the attributes are passed.
- The values of principal components can be checked using components_ while the variance explained by each principal component can be calculated using explained_variance_ratio.
1. Import all the libraries
2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has 569 data items with 30 input attributes. There are two output classes-benign and malignant. Due to 30 input features, it is impossible to visualize this data.
3. Apply PCA
- Standardize the dataset prior to PCA.
- Import PCA from sklearn.decomposition.
- Choose the number of principal components.
Let us select it to 3. After executing this code, we get to know that the dimensions of x are (569,3) while the dimension of actual data is (569,30). Thus, it is clear that with PCA, the number of dimensions has reduced to 3 from 30. If we choose n_components=2, the dimensions would be reduced to 2.
4. Check Components
The principal.components_ provides an array in which the number of rows tells the number of principal components while the number of columns is equal to the number of features in actual data. We can easily see that there are three rows as n_components was chosen to be 3. However, each row has 30 columns as in actual data.
5. Plot the components (Visualization)
Plot the principal components for better data visualization. Though we had taken n_components =3, here we are plotting a 2d graph as well as 3d using first two principal components and 3 principal components respectively. For three principal components, we need to plot a 3d graph. The colors show the 2 output classes of the original dataset-benign and malignant. It is clear that principal components show clear separation between two output classes.
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal component. Similarly, x[:,1] and x[:,2] represent the second and the third principal component.
6. Calculate variance ratio
Explained_variance_ratio provides an idea of how much variation is explained by principal components.
array([0.44272026, 0.18971182, 0.09393163])