Open In App

Model Selection with Probabilistic PCA and Factor Analysis (FA) in Scikit Learn

Last Updated : 30 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In the field of machine learning, model selection plays a vital role in finding the most suitable algorithm for a given dataset. When dealing with dimensionality reduction tasks, methods such as Principal Component Analysis (PCA) and Factor Analysis (FA) are commonly employed. However, in scenarios where the assumption of linearity in PCA may not hold for high-dimensional data, Factor Analysis can be a more appropriate alternative. In this article, we will explore how to perform model selection using Probabilistic PCA and Factor Analysis in Scikit-Learn, a popular Python library for machine learning.

Concepts related to the topic:

  1. Probabilistic PCA (PPCA): PPCA extends traditional PCA by incorporating a probabilistic framework. It assumes that the observed data is generated by projecting the latent variables into a high-dimensional space, followed by the addition of Gaussian noise. PPCA estimates the latent variables and the noise parameters using maximum likelihood estimation, providing a probabilistic interpretation of the low-dimensional representation.
  2. Factor Analysis (FA): FA assumes a generative model where the observed variables are linear combinations of the latent variables, along with Gaussian noise. The goal is to estimate the latent variables and the loading matrix that represents the linear relationships between the observed and latent variables. FA also provides a probabilistic interpretation of the dimensionality reduction process.

Homoscedastic Noise

Heteroscedastic noise is a type of noise that has an unequal variance across values of the independent variable. This is often created by complex relationships between variables and nonlinear patterns.

1. Import the necessary libraries and create the Homoscedastic Noise dataset

Python3




import numpy as np
 
n_samples = 250
n_features = 30
mean = 0
sigma = 5
np.random.RandomState(23)
# Generate homoscedastic noise
homo_noise = sigma*np.random.rand(n_features)
 
#homoscedastic noise dataset
X_homoscedastic = np.random.normal(mean, sigma, (n_samples,n_features)) +  homo_noise
 
# Print the shapes of the generated datasets
print("Homoscedastic Noise Dataset Shape:", X_homoscedastic.shape)


Output:

Homoscedastic Noise Dataset Shape: (250, 30)

2. Compute the PCA and Factor Analysis and find the cross_val_score

Python3




from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import cross_val_score
 
pca = PCA(svd_solver="full")
fa = FactorAnalysis()
 
def compute_score(X, model, n_components):
    score =[]
    for n in n_components:
        model.n_components = n
        score.append(np.mean(cross_val_score(model, X)))
 
    return score
 
 
# Number of component
n_components = [ 051015, 20, 25, 30]
 
 
pca_scores = compute_score(X_homoscedastic, pca, n_components)
fa_scores = compute_score(X_homoscedastic, fa, n_components)


3. Plot the PCA score and FA score vs n_componrnt curve

Python3




import matplotlib.pyplot as plt
 
plt.plot(n_components, pca_scores, "b", label="PCA scores")
plt.plot(n_components, fa_scores, "r", label="FA scores")
plt.xlabel("nb of components")
plt.ylabel("CV scores")
plt.legend()
plt.title('Heteroscedastic noise')
plt.show()


Output:

Homoscedastic Noise-Geeksforgeeks

Homoscedastic Noise

Heteroscedastic Noise

Heteroscedastic noise is a type of noise that has an unequal variance across values of the independent variable. This is often created by complex relationships between variables and nonlinear patterns.

1. Import the necessary libraries and create the Heteroscedastic Noise dataset

Python3




# Generate dataset with homoscedastic noise
n_samples = 1000
n_features = 30
mean = 0
sigma = 2.5
np.random.RandomState(23)
# Generate dataset with heteroscedastic noise
sigmas = sigma * np.random.rand(n_features)
 
hetero_noise = sigmas * np.random.normal(mean, sigma, (n_samples,n_features))
X_heteroscedastic = np.random.normal(mean, sigma, (n_samples,n_features)) + hetero_noise
 
print("Heteroscedastic Noise Dataset Shape:", X_heteroscedastic.shape)


Output:

Heteroscedastic Noise Dataset Shape: (1000, 30)

2. Compute the PCA and Factor Analysis and find the cross_val_score

Python3




pca = PCA(svd_solver="full")
fa = FactorAnalysis()
 
def compute_score(X, model, n_components):
    score =[]
    for n in n_components:
        model.n_components = n
        score.append(np.mean(cross_val_score(model, X)))
 
    return score
 
 
# Number of component
n_components = [ 051015, 20, 25, 30]
 
pca_scores = compute_score(X_heteroscedastic, pca, n_components)
fa_scores = compute_score(X_heteroscedastic, fa, n_components)


3. Plot the PCA score and FA score vs n_componrnt curve

Python3




import matplotlib.pyplot as plt
 
plt.plot(n_components, pca_scores, "g", label="PCA scores")
plt.plot(n_components, fa_scores, "r", label="FA scores")
plt.xlabel("nb of components")
plt.ylabel("CV scores")
plt.legend()
plt.title('Heteroscedastic Noise')
plt.show()


Output:

Heteroscedastic Noise-Geeksforgeeks

Heteroscedastic Noise

Example :

To illustrate the process of model selection with Probabilistic PCA and Factor Analysis (FA) using Scikit-learn, let’s consider an example where we apply these techniques to the Digits dataset. We will use the GridSearchCV class to perform model selection and find the best parameters for both PCA and FA. The code snippet provided below demonstrates how to load the dataset, define the parameter grid, fit the models, and access the best models and their parameters. Additionally, it showcases the transformation of the data using the best models. The corresponding output highlights the best model parameters and the transformed data obtained from both PCA and Factor Analysis.

Python3




import numpy as np
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
 
# Load the digits dataset
X = datasets.load_digits().data
# Reshape the dataset
n_samples, n_features = X.shape
X = X.reshape((n_samples, -1))
 
# Define the parameter grid
param_grid = {'n_components': [2, 5, 10]}
 
# Perform model selection using GridSearchCV
ppcamodel = GridSearchCV(PCA(), param_grid=param_grid)
famodel = GridSearchCV(FactorAnalysis(), param_grid=param_grid)
 
# Fit the models
ppcamodel.fit(X)
famodel.fit(X)
 
# Access the best model and its parameters
best_ppca_model = ppcamodel.best_estimator_
best_ppca_params = ppcamodel.best_params_
 
best_fa_model = famodel.best_estimator_
best_fa_params = famodel.best_params_
 
# Apply the best model for dimensionality reduction
X_ppca = best_ppca_model.transform(X)
X_fa = best_fa_model.transform(X)
 
# Print the best models and their parameters
print("Best PCA Model:")
print(best_ppca_model)
print("Best PCA Parameters:")
print(best_ppca_params)
 
print("\nBest Factor Analysis Model:")
print(best_fa_model)
print("Best Factor Analysis Parameters:")
print(best_fa_params)
 
# Print the transformed data
print("\nTransformed Data using PCA:")
print(X_ppca)
print("\nTransformed Data using Factor Analysis:")
print(X_fa)


Output:

Best PCA Model:
PCA(n_components=10)
Best PCA Parameters:
{'n_components': 10}

Best Factor Analysis Model:
FactorAnalysis(n_components=10)
Best Factor Analysis Parameters:
{'n_components': 10}

Transformed Data using PCA:
[[ -1.25946749  21.27488252  -9.46305634 ...   2.55462354  -0.58278883
    3.62919484]
 [  7.95760992 -20.76870158   4.43950645 ...  -4.6158487    3.58974259
   -1.07981018]
 [  6.99191503  -9.95598027   2.95855308 ... -16.41785644   0.71701599
    4.25521831]
 ...
 [ 10.80128272  -6.960248     5.59955483 ...  -7.4183565   -3.96726241
  -13.06151415]
 [ -4.87210282  12.42395632 -10.17086414 ...  -4.36248613   3.93943916
  -13.15159048]
 [ -0.3443907    6.36555315  10.77370724 ...   0.66827285  -4.11461914
  -12.56011197]]

Transformed Data using Factor Analysis:
[[-0.13967445 -0.34673074  0.5564195  ... -0.83954767  0.09716367
   0.34834512]
 [-0.87463488 -0.21243303 -0.4980917  ...  0.00387598 -0.26655999
   0.79413425]
 [-1.07614501  0.64196322 -0.27097307 ... -1.32789623 -0.91709769
  -1.66106236]
 ...
 [-0.70284004 -0.07191784 -0.69943904 ... -0.62932638 -1.31438831
   1.22479397]
 [-0.33269469 -0.0346382   1.36587899 ... -0.87243003 -0.0784538
   0.63416391]
 [ 0.60585414  0.83341048 -0.34026351 ...  0.1495371  -0.94304955
   0.74673146]]

The output demonstrates the results of the model selection process using Probabilistic PCA and Factor Analysis in Scikit-learn. It includes the best model and its parameters, as well as the transformed data obtained from the best models.

The output begins by displaying the best PCA model and its parameters. In this example, the best PCA model has n_components=10, indicating that it reduces the dimensionality of the input data to 10 components. Similarly, the best Factor Analysis model and its parameters are shown, where n_components=10 denotes the number of components retained in the transformed data.

Following the model information, the transformed data using PCA and Factor Analysis is presented. The transformed data represents the original input data projected onto the lower-dimensional space determined by the selected models. The PCA-transformed data is displayed as an array of shape (n_samples, n_components), while the Factor Analysis-transformed data is also shown in a similar format.

Comparison and Choosing the Right Method :

Probabilistic PCA and Factor Analysis are both popular methods for dimensionality reduction, but they have distinct characteristics that may influence the choice between them. Here are some points to consider when deciding which method to use:

  • Objective: Probabilistic PCA aims to find a low-dimensional representation of the data that maximizes the likelihood of the observed data. It assumes a probabilistic generative model where the observed data is assumed to be generated from a lower-dimensional latent space. On the other hand, Factor Analysis assumes a linear relationship between the observed variables and the latent factors, with an added noise term. The objective of Factor Analysis is to estimate the latent factors that underlie the observed data.
  • Assumptions: Probabilistic PCA assumes that the observed data follows a Gaussian distribution, while Factor Analysis assumes that the observed data is a linear combination of the latent factors and an additional noise term. Therefore, if the underlying data distribution deviates from these assumptions, the results may be affected.
  • Dimensionality Reduction: Probabilistic PCA provides a more flexible approach for dimensionality reduction, allowing for non-linear transformations of the input data. It captures both global and local dependencies in the data. Factor Analysis, on the other hand, assumes a linear relationship between the observed variables and the latent factors, making it suitable for capturing global dependencies but may not capture complex non-linear relationships.
  • Interpretability: Factor Analysis provides a more interpretable representation since it explicitly estimates the relationship between the observed variables and the latent factors. The latent factors can be interpreted as underlying factors influencing the observed data. In contrast, Probabilistic PCA focuses on finding the low-dimensional representation without explicitly interpreting the latent factors.

Comparison table between Probabilistic PCA and Factor Analysis:

S.No.   

Features

Probabilistic PCA

Factor Analysis

1.

Objective

Maximizes likelihood of observed data

Estimates latent factors underlying observed data

2.

Assumptions

Gaussian distribution of observed data

A linear  the  relationship between observed variables and latent factors

3.

Dimensionality

Flexible approach, captures non-linear relationships

Assumes linear relationship, may not capture complex non-linear relationships

4.

Interpretability

Less interpretable, focuses on low-dimensional representation

More interpretable, explicit estimation of relationship between observed variables and latent factors

5.

Data Distribution

Assumes Gaussian distribution

Assumes a linear combination of latent factors and noise term

6.

Non-linearity

Can capture non-linear transformations of input data

Assumes linear relationship, limited in capturing non-linear relationships

7.

Data Type

Suitable for high-dimensional data

Suitable for understanding underlying factors in observed data

8.

Performance

Good for dimensionality reduction

Good for interpreting relationships and understanding underlying factors

In general,  “Probabilistic PCA is suitable when dealing with high-dimensional data that may exhibit non-linear relationships and when interpretability of the latent factors is not the primary concern.”

“Factor Analysis, on the other hand, is preferred when interpretability and understanding the underlying factors driving the observed data are important.”

The choice between Probabilistic PCA and Factor Analysis depends on the characteristics of the dataset and the goals of the analysis. It is recommended to experiment with both methods and evaluate their performance in terms of dimensionality reduction and the interpretability of the results.

Conclusion:

In this article, we explored the utilization of Probabilistic PCA and Factor Analysis in Scikit-Learn for model selection in dimensionality reduction tasks. By leveraging Scikit-Learn’s GridSearchCV, we efficiently evaluated various parameter combinations and identified the best models based on the specified scoring metric.

Both Probabilistic PCA and Factor Analysis offer valuable techniques for dimensionality reduction, each with its own unique strengths. Probabilistic PCA excels in handling high-dimensional datasets and capturing non-linear relationships, while Factor Analysis provides interpretable representations by uncovering latent factors.

The choice between Probabilistic PCA and Factor Analysis depends on the specific characteristics of the dataset and the objectives of the analysis. Probabilistic PCA is suitable when dealing with high-dimensional data and non-linear relationships, whereas Factor Analysis is preferable when interpretability and understanding underlying factors are paramount.

By applying these techniques, researchers and practitioners can effectively reduce the dimensionality of datasets, leading to improved performance in subsequent machine learning tasks. Dimensionality reduction not only reduces computational complexity but also eliminates noise and irrelevant features, ultimately enhancing model accuracy.

In summary, Probabilistic PCA and Factor Analysis serve as powerful tools for dimensionality reduction in Scikit-Learn. Understanding their strengths and characteristics enables us to select the most appropriate approach for our specific dataset and analysis goals. Incorporating model selection techniques such as GridSearchCV further allows us to fine-tune parameters and identify the optimal models. By harnessing these techniques, we can extract insights from high-dimensional data and enhance the efficiency and accuracy of our machine-learningworkflows.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads