Develop Multioutput Regression Model using Python

Multioutput regression is a specialized form of supervised machine learning that deals with predicting multiple target variables simultaneously. While traditional regression focuses on predicting a single numerical value (target variable) based on a set of input features, multioutput regression extends this concept to predict multiple numerical values at once which is a valuable technique in various real-world applications where there are multiple dependent variables or complex relationships to be modeled. In this article, we will implement a multioutput regressor of Sklearn and compare its performance with traditional tree-based and ensemble models.

What is Multioutput Regression?

Multioutput regression or multi-target regression tackles problems where we need to predict more than one continuous target variable which is just an extension of traditional regression where we focus on predicting single-target. In multioutput regression, each of the target variables is treated as a separate regression problem and the goal is to create a model which can generate accurate predictions for all of them simultaneously. There are some key aspects of this special regression task are listed below:

Multiple Target Variables: In multioutput regression, we work with special datasets where two or more target variables we want to predict. These targets can be related or independent and they might represent different aspects or dimensions of the problem which we are trying to solve.
Evaluation Metrics: In multioutput regression, we use performance metrics that are common for regression tasks like MSE, MAE, MAPE, R2-score, etc. Keep in mind that, multioutput regression is nothing but a simple regression task but with in multiple numbers with simultaneous prediction.
Challenges: In this regression task, we need to handle datasets with different scales and units for target variables which sometimes become complex as we need to deal with correlations between target variables. Also, this task is very prone to overfitting problems because we need to deal with multiple targets which means high dimensionality when predicting multiple targets.

How to solve Multioutput Regression

Every problem has its solution. Multioutput regression has several kinds of solutions which are listed below:

Multioutput Regressor: It is a specially designed model of SKlearn to explicitly solve only multioutput regression tasks. We discuss its key-features and implementation later.
Multioutput Linear Regression: This extends simple linear regression to handle multiple target variables which is trained to predict each target as a linear combination of input features. But this model can’t capture complex context or relationships between features.
Decision Trees and Random Forests: Tree-based models can be used for multioutput regression because they can capture non-linear relationships between features and targets. We will show comparison between these two with multioutput regressor.
Neural Networks: Deep learning models can be adapted for multioutput regression by having multiple output neurons, one for each target variable which is a complex, memory consuming and costly task.

Parameters of Multioutput Regression

An extension of conventional machine learning models to accommodate multioutput regression tasks is the MultiOutputRegressor wrapper class in scikit-learn. This comes in very handy when you wish to anticipate several continuous target variables at once. Here are the parameters that the MultiOutputRegressor class takes:

estimator: The regression model you want to use as the foundation for multioutput regression is specified by this parameter, which is called the base estimator. Any regression estimator—linear regression, decision trees, random forests, etc.—that is compatible with scikit-learn may be used.
n_jobs (default=None): The number of CPU cores to be utilized during the model fitting procedure is controlled by this parameter. To allow parallel processing, set it to an integer number larger than 1. It will utilize all CPU cores if set to -1.
verbose (default=0): It regulates the estimator’s verbosity. During model training, a larger value yields output that is more detailed.
check_inverse (default=True): The parameter is boolean. It determines if the estimator has an inverse property if set to True. It is primarily utilized in situations where the estimator’s inverse is required, which is pertinent to some algorithms.
estimator_params (default=None): With the help of this parameter, you can provide extra parameters that should be supplied to the estimator upon startup. The values are the parameter values, and the keys are the names of the parameters. It is a dictionary.

Multioutput Regressor Model

The Multioutput Regressor in scikit-learn is a wrapper or meta-estimator which allows us to extend single-output regression models to perform multioutput regression. It’s a convenient way to tackle tasks where we need to predict multiple target variables simultaneously using one or more base regression models. Some key-features of it is discussed below:

Extension of Single-Output Models: Multioutput Regressor extends the capabilities of traditional single-output regression models to handle multioutput regression tasks which can wrap any regressor that supports single-output regression.
Parallel Prediction: Multioutput Regressor treats each target variable as an independent single-output regression problem and fits one base regressor per target variable which allows parallel prediction.
Consistency: It ensures that the base regressors are fitted independently and predictions are combined into a multioutput format.
Flexibility: We can use a wide range of base regression models as the internal regressors like linear regression, decision trees, support vector machines and random forest etc.

Implementation of Multioutput Regression

In this implementation, we are going to explore the use of scikit-learn for multioutput regression. It can be applied to a variety of real-world tasks, including multi-label classification, multi-target regression, and more. It is especially helpful in situations when each data point has numerous related target values.

Importing required libraries

Python3

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.multioutput import MultiOutputRegressor

from sklearn.linear_model import ElasticNet

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error

The libraries NumPy, Matplotlib, Pandas, Seaborn, scikit-learn, and others required for data analysis and machine learning are imported by this code. A dataset is loaded and divided into training and testing sets (most likely via the datasets module of scikit-learn). Additionally, models such as ElasticNet, DecisionTreeRegressor, and RandomForestRegressor are imported, used to predict multiple target variables, and evaluated using metrics such as mean squared error and mean absolute error.

Dataset loading and Splitting

Python3

# Load the Linnerud dataset

linnerud = datasets.load_linnerud()

X, y = linnerud.data, linnerud.target
 
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42)

This code separates the input data (X) and target variables (y) and loads the Linnerud dataset using the scikit-learn datasets module. Next, using a random state of 42, it divides the dataset into training and testing sets in an 80-20 ratio to ensure reproducibility. This makes it possible to train and assess machine learning models.

Exploratory Data Analysis

An essential phase in data analysis is Exploratory Data Analysis (EDA), which offers a thorough grasp of a dataset’s properties. It entails highlighting key characteristics, finding patterns, and spotting trends. EDA aids with data distribution analysis, missing value detection, and outlier detection. Through the use of charts such as scatter plots and histograms, EDA provides feature selection guidance and insights into relationship patterns within the data. All things considered, EDA improves the process of making data-driven decisions by providing insights into the latter phases of data preparation and model construction.

Correlation Matrix

Visualizing correlation matrix will help us to understand the relationships between the different features present in the dataset.

Python3

# Create a DataFrame for the Linnerud dataset

df = pd.DataFrame(data=X, columns=linnerud.feature_names)
# Calculate the correlation matrix

correlation_matrix = df.corr()
 
# Plot a heatmap of the correlation matrix

plt.figure(figsize=(4, 3))

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

plt.title('Correlation Matrix')
plt.show()

Output:

Correlation matrix for Linnerud Dataset

These lines of code compute the correlation matrix for the characteristics in the Linnerud dataset and generate a DataFrame from those features. The correlation matrix can be plotted as a heatmap using the sns.heatmap function. The heatmap shows the relationships between several features visually, with values labeled in each cell and color denoting the correlation’s intensity and direction. This aids in determining the connections between the dataset’s variables.

Distribution of Target variable

The target variable of this dataset has three types. Visualizing it will help us to understand its behavior.

Python3

df = pd.DataFrame(data=X, columns=linnerud.target_names)
# Plot the distribution of the target variables

plt.figure(figsize=(10, 4))

plt.subplot(1, 3, 1)

sns.histplot(df['Weight'], kde=True, color='green')

plt.title('Distribution of Weight')
 
plt.subplot(1, 3, 2)

sns.histplot(df['Waist'], kde=True, color='green')

plt.title('Distribution of Waist')
 
plt.subplot(1, 3, 3)

sns.histplot(df['Pulse'], kde=True, color='green')

plt.title('Distribution of Pulse')
 
plt.tight_layout()
plt.show()

Output:

Distribution of the target variable

The target variables, “Weight,” “Waist,” and “Pulse,” from the Linnerud dataset, are used to create a DataFrame with the name df in this code. After that, a figure is made that has three subplots to show how these target variables are distributed. With the ‘kde=True’ option added to add a kernel density estimate to improve the visualization, each subplot shows a histogram of one of the variables. All of the subplots have the color green set. The subplots, titled “Distribution of Weight,” “Distribution of Waist,” and “Distribution of Pulse,” are titled according to the variable that is being observed. Plotting subplots is made easier with the help of the plt.tight_layout() method. The full figure is shown using plt.show(), which also gives us further insight into the target variables’ distribution.

Model training

Now we will train the Multioutput regressor model of SKlearn. And as it is discussed earlier we will also employ tree-based traditional machine learning model Decision Tree and Ensemble model Random Forest to show performance comparison later on this article.

Python3

# Create and train the multioutput regression model (ElasticNet)

multioutput_model = MultiOutputRegressor(

    ElasticNet(alpha=0.5, l1_ratio=0.5), n_jobs=5)
multioutput_model.fit(X_train, y_train)
 
# Create and train the decision tree regressor model

tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
 
# Create and train the random forest regressor model

forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

In this code, we create and train three different multioutput regression models for the Linnerud dataset. First, the multioutput_model employs ElasticNet regression, configured with specific alpha and L1 ratio values, and is designed to handle multiple target variables simultaneously. It is parallelized with five jobs to enhance efficiency. Second, the tree_model is a DecisionTreeRegressor, a single decision tree-based regression model that learns the relationships between input features and target variables. Finally, the forest_model is a RandomForestRegressor consisting of 100 decision trees, offering an ensemble approach to regression. These models are trained on the training data and are ready for making predictions on the testing dataset, enabling comparison of their performance in predicting the target variables. We have use a ElasticNet linear model as a base estimator of Multioutput regressor model. And set n_jobs parameter to 5 means total five parallel computing will be occurred.

Using a wrapper class offered by scikit-learn, the MultiOutputRegressor expands on standard machine learning models to address multioutput issues. It is used with the ElasticNet regression model in the provided code. Let’s dissect the parts:

The L1 (Lasso) and L2 (Ridge) regularization components are combined in the ElasticNet linear regression model. When there are several associated features, it is helpful and helps avoid overfitting.
The degree of regularization is managed by the alpha parameter. Weaker regularization is indicated by smaller values.
The L1 and L2 regularization are balanced by the l1_ratio parameter. When the “l1_ratio” is 1, it indicates Lasso (L1 regularization), and when it is zero, it indicates Ridge (L2 regularization).
The number of CPU cores to be used during the model fitting procedure is specified by the n_jobs option. If you have a multi-core CPU, using “n_jobs=5” to indicate that five CPU cores will be used for parallel processing during model training can greatly expedite the training process.

Multioutput regression problems that need simultaneous prediction of numerous continuous variables can be addressed by using the MultiOutputRegressor, which extends the ElasticNet model to handle multiple target variables.

Model evaluation

Now we will evaluate all the regression models in the terms of MSE and MAE which are two commonly used regression model performance metrics.

Python3

# Make predictions

multioutput_pred = multioutput_model.predict(X_test)

tree_pred = tree_model.predict(X_test)

forest_pred = forest_model.predict(X_test)
 
# Calculate performance metrics for multioutput model

multioutput_mse = mean_squared_error(y_test, multioutput_pred)

multioutput_mae = mean_absolute_error(y_test, multioutput_pred)
 
# Calculate performance metrics for decision tree model

tree_mse = mean_squared_error(y_test, tree_pred)

tree_mae = mean_absolute_error(y_test, tree_pred)
 
# Calculate performance metrics for random forest model

forest_mse = mean_squared_error(y_test, forest_pred)

forest_mae = mean_absolute_error(y_test, forest_pred)
 
# Print the performance metrics

print("Multioutput Model - Mean Squared Error:", multioutput_mse)

print("Multioutput Model - Mean Absolute Error:", multioutput_mae)

print("Decision Tree Model - Mean Squared Error:", tree_mse)

print("Decision Tree Model - Mean Absolute Error:", tree_mae)

print("Random Forest Model - Mean Squared Error:", forest_mse)

print("Random Forest Model - Mean Absolute Error:", forest_mae)

Output:

Multioutput Model - Mean Squared Error: 236.22543973611653
Multioutput Model - Mean Absolute Error: 10.015359327324276
Decision Tree Model - Mean Squared Error: 371.9166666666667
Decision Tree Model - Mean Absolute Error: 12.083333333333334
Random Forest Model - Mean Squared Error: 242.75831666666667
Random Forest Model - Mean Absolute Error: 10.656666666666666

This code uses the test data to assess the performance of three regression models. Among the models are the Random Forest, Decision Tree, and Multioutput ElasticNet models. Each model generates predictions, and the Mean Squared Error (MSE) and Mean Absolute Error (MAE), two popular regression metrics, are used to evaluate each model’s performance. Better model fit and forecast accuracy are shown by lower MSE and MAE values. These metrics can be compared between the three models’ efficacy in resolving the multioutput regression problem because the code computes and outputs them for each model. Based on the evaluation’s prediction accuracy, the best model for the job can be chosen.

Performance comparison visualization

Now we will visualize how models have performed and which model outperform from others.

Python3

# Create a comparative visualization

plt.figure(figsize=(10, 4))

models = ['Multioutput', 'Decision Tree', 'Random Forest']

mse_scores = [multioutput_mse, tree_mse, forest_mse]

mae_scores = [multioutput_mae, tree_mae, forest_mae]
 
# Plot Mean Squared Error (MSE)

plt.subplot(1, 2, 1)

plt.bar(models, mse_scores, color=['blue', 'green', 'purple'])

plt.xlabel('Models')

plt.ylabel('Mean Squared Error (MSE)')

plt.title('Comparative MSE Scores')
 
# Plot R-squared Score (R2)

plt.subplot(1, 2, 2)

plt.bar(models, mae_scores, color=['blue', 'green', 'purple'])

plt.xlabel('Models')

plt.ylabel('Mean Absolute Error(MAE)')

plt.title('Comparative MAE Scores')
 
plt.tight_layout()
plt.show()

Output:

Using the three regression models—Multioutput, Decision Tree, and Random Forest—this code generates a comparative visualization to evaluate their respective performances. Two bar charts are produced side by side. Each model’s Mean Squared Error (MSE) scores are shown on the left chart, while the Mean Absolute Error (MAE) values are shown on the right chart. On the x-axis, the models are labeled, and bars with different colors represent the relevant MSE and MAE scores. With lower MSE and MAE values indicating greater performance, this representation makes it possible to compare the models’ predicted accuracy directly.

Conclusion

We can conclude that solving multioutput regression task is a computationally costly task but at the same time it is very important for real-world problem solving. The Multioutput regressor of SKlearn can perform well even in comparison with tree-based models. So, this model can be a good weapon to destroy the complexity of multioutput regression tasks in a easy way. However, to achieve good model performance we need further hyper-parameter tuning or deeper data pre-processing.

Article Tags :

Geeks Premier League

Machine Learning

Geeks Premier League 2023

Python scikit-module