Open In App

Partial Least Squares (PLS) Canonical

Last Updated : 23 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In the realm of machine learning, it’s essential to have a diverse toolkit to solve various complex problems. Partial Least Squares (PLS) Canonical, a technique rooted in both regression and dimensionality reduction, has gained significant traction in recent years. This method, which finds patterns in data by projecting it onto a lower-dimensional space, has been successfully implemented in the Scikit-learn (Sklearn) library, offering a robust solution for predictive modelling and analysis. In this article, we’ll delve into the world of PLS Canonical, exploring its principles, applications, and implementation within Sklearn.

Understanding PLS Canonical

At its core, PLS Canonical is a multivariate statistical technique used for modelling the relationship between two sets of variables, commonly referred to as X and Y. Unlike traditional regression methods, PLS Canonical is often used for dimensionality reduction that works exceptionally well in situations where the number of predictor variables (features) is high and potentially correlated, making it a perfect choice for dealing with multicollinearity and high-dimensional data.

The primary goal of PLS Canonical is to find the latent variables (also known as components) within X and Y that maximize the covariance between them. These latent variables serve as a condensed representation of the original data, capturing the essential information needed for modelling while minimizing noise. The basic idea is to project the data into a lower-dimensional space, where the new variables (latent variables) are linear combinations of the original variables.

Features of PLS Canonical

  • Handling Multicollinearity: PLS Canonical addresses multicollinearity by identifying latent variables that explain the maximum covariance between the predictor variables (X) and the target variables (Y). By doing so, it creates a set of uncorrelated components, which are used for modelling.
  • Preventing Overfitting: PLS Canonical inherently performs dimensionality reduction, reducing the risk of overfitting. By extracting the most informative components, it captures the essential patterns in the data while avoiding the noise, leading to a more generalized model.
  • Improved Generalization: With PLS Canonical, the model is built on a smaller set of components rather than the original high-dimensional space. This compact representation allows the model to generalize better to unseen data, especially in cases where the original features are redundant or irrelevant.

Workings of PLS-Canonical

PLS-Canonical is a multivariate statistical method that can be used to identify and quantify the relationships between two sets of variables. PLS-Canonical is a generalization of canonical correlation analysis (CCA), which is a traditional statistical method for finding linear relationships between two sets of variables.

PLS-Canonical works by constructing two latent variable spaces, one for each set of variables. The latent variable spaces are constructed in such a way that they maximize the covariance between the two spaces. Once the latent variable spaces have been constructed, PLS-Canonical calculates the canonical correlations between the two spaces. The canonical correlations are measures of the strength and direction of the relationships between the two sets of variables.

Steps involved in PLS Canonical

  1. Standardization: Before starting the analysis, it is common practice to standardize the predictor and response variables to have zero mean and unit variance to ensure that variables with different scales do not dominate the analysis.
  2. Initialization: PLS Canonical starts by initializing the weight vectors w_x   and w_y   . Typically, these vectors are randomly initialized or set to the first principal components of X and Y.
  3. Iterative Process: PLS Canonical then iteratively finds the latent variables and update the weight vectors to maximize the covariance between X and Y. The number of latent variables, denoted by t and u, is determined through cross-validation. Latent variables are linear combinations of the original predictor variables (X) and response variables (Y).

For each iteration, following steps are performed:

Calculation of Latent Variables

Thе t-th latent variable in X, denoted as t = Xw_x   , is obtained by projecting X onto the weight vector w_x   . Similarly, the u-th latent variable in Y, denoted as u = Yw_y   , is obtained by projecting Y onto the weight vector w_y   .

The weight vectors are used to project ( X ) and ( Y ) onto the latent variables.

Calculation of Wеights

Thе weight vectors w_x   and w_y   are updated to maximize the covariance between t and u. This is done by finding the directions in X and Y that maximize the covariance between t and u. The updated weight vectors are denoted as w_x'   and w_y'   where:

w_x' = \frac{{(X'Y)u}}{{(u'u)}}

w_y' = \frac{{(Y'X)t}}{{(t't)}}

Numerator of each equation represents the covariance between the latent variables ( t ) and ( u ) with the corresponding matrices ( X ) and ( Y ). The denominator represents the normalization factor. X’ and Y’ represent the transposed matrices of X and Y.

Deflation

After obtaining the updated weight vectors, the t-th latent variable in X and u-th latent variable in Y are deflated i.e the linear combinations t and u are removed from X and Y, respectively, to focus on the remaining covariance.

Iteration and Final Model

The above steps are repeated for a specified number of iterations or until convergence is achieved. After the desired number of latent variables is obtainеd, the final PLS Canonical model is built using the calculated latent variables and weight vectors.

Applications of PLS Canonical

  • Predictive Modeling: PLS Canonical is widely used in fields like chemometrics, bioinformatics, and social sciences for building predictive models. Its ability to handle datasets with numerous variables and limited samples makes it valuable in scenarios where accurate predictions are crucial.
  • Feature Selection and Extraction: In high-dimensional datasets, identifying relevant features is a challenge. PLS Canonical aids in feature selection by highlighting the variables that contribute significantly to the relationship between X and Y. Additionally, it can be employed for feature extraction, creating composite variables that encapsulate the original features’ information.
  • Data Fusion: PLS Canonical is employed in data fusion tasks, where information from multiple sources is integrated to enhance modeling accuracy. By combining diverse datasets, it allows for a more comprehensive analysis and better-informed decision-making.

Implementing PLS Canonical in Sklearn

Sklearn, as a powerful and versatile machine learning library in Python, provides an implementation of PLS Canonical through the PLSCanonical class.

Step 1: Importing the Necessary Modules

Python

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.cross_decomposition import PLSCanonical
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

                    

Step 2: Dataset Loading and Splitting

Let’s load the dataset from sklearn and the further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.

Python3

df = load_diabetes(as_frame=True).frame
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

                    

Step 3: Standardizing the variables and fitting Model

Python

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

                    

Step 4: Creating the PLS Canonical Model

  • n_components parameter determines the number of components to extract.
  • PLSCanonical class is used to create the PLS Canonical model.

Python3

pls_canonical = PLSCanonical(n_components=1)
pls_canonical.fit(X_train_scaled, y_train)

                    

Step 4: Transforming the data

Python3

X_train_transformed = pls_canonical.transform(X_train_scaled)
X_test_transformed = pls_canonical.transform(X_test_scaled)

                    

Step 5: Build and train a model

Python

model= LinearRegression()
model.fit(X_train_transformed, y_train)
predictions = model.predict(X_test_transformed)

                    

Step 5: Evaluation Metrics

Mean Squared Error (MSE) measures the average squared difference between the actual and predicted values

Python3

mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

                    

Output:

Mean Squared Error: 3185.0242475575124

An MSE of 0.17 means that, on average, the predicted values are off by 0.17 squared units from the true values. Lower MSE values indicate better regression performance.

Conclusion

Partial Least Squares (PLS) Canonical, with its ability to handle high-dimensional and correlated data, is a valuable addition to any data scientist’s toolkit. Its implementation in Scikit-learn simplifies the process of building predictive models, performing feature selection, and integrating diverse datasets. By understanding the principles and applications of PLS Canonical, data scientists can unlock new avenues for solving complex problems and gaining deeper insights from their data.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads