Open In App

Develop Multioutput Regression Model using Python

Multioutput regression is a specialized form of supervised machine learning that deals with predicting multiple target variables simultaneously. While traditional regression focuses on predicting a single numerical value (target variable) based on a set of input features, multioutput regression extends this concept to predict multiple numerical values at once which is a valuable technique in various real-world applications where there are multiple dependent variables or complex relationships to be modeled. In this article, we will implement a multioutput regressor of Sklearn and compare its performance with traditional tree-based and ensemble models.

What is Multioutput Regression?

Multioutput regression or multi-target regression tackles problems where we need to predict more than one continuous target variable which is just an extension of traditional regression where we focus on predicting single-target. In multioutput regression, each of the target variables is treated as a separate regression problem and the goal is to create a model which can generate accurate predictions for all of them simultaneously. There are some key aspects of this special regression task are listed below:



How to solve Multioutput Regression

Every problem has its solution. Multioutput regression has several kinds of solutions which are listed below:

  1. Multioutput Regressor: It is a specially designed model of SKlearn to explicitly solve only multioutput regression tasks. We discuss its key-features and implementation later.
  2. Multioutput Linear Regression: This extends simple linear regression to handle multiple target variables which is trained to predict each target as a linear combination of input features. But this model can’t capture complex context or relationships between features.
  3. Decision Trees and Random Forests: Tree-based models can be used for multioutput regression because they can capture non-linear relationships between features and targets. We will show comparison between these two with multioutput regressor.
  4. Neural Networks: Deep learning models can be adapted for multioutput regression by having multiple output neurons, one for each target variable which is a complex, memory consuming and costly task.

Parameters of Multioutput Regression

An extension of conventional machine learning models to accommodate multioutput regression tasks is the MultiOutputRegressor wrapper class in scikit-learn. This comes in very handy when you wish to anticipate several continuous target variables at once. Here are the parameters that the MultiOutputRegressor class takes:



Multioutput Regressor Model

The Multioutput Regressor in scikit-learn is a wrapper or meta-estimator which allows us to extend single-output regression models to perform multioutput regression. It’s a convenient way to tackle tasks where we need to predict multiple target variables simultaneously using one or more base regression models. Some key-features of it is discussed below:

  1. Extension of Single-Output Models: Multioutput Regressor extends the capabilities of traditional single-output regression models to handle multioutput regression tasks which can wrap any regressor that supports single-output regression.
  2. Parallel Prediction: Multioutput Regressor treats each target variable as an independent single-output regression problem and fits one base regressor per target variable which allows parallel prediction.
  3. Consistency: It ensures that the base regressors are fitted independently and predictions are combined into a multioutput format.
  4. Flexibility: We can use a wide range of base regression models as the internal regressors like linear regression, decision trees, support vector machines and random forest etc.

Implementation of Multioutput Regression

In this implementation, we are going to explore the use of scikit-learn for multioutput regression. It can be applied to a variety of real-world tasks, including multi-label classification, multi-target regression, and more. It is especially helpful in situations when each data point has numerous related target values.

Importing required libraries




import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

The libraries NumPy, Matplotlib, Pandas, Seaborn, scikit-learn, and others required for data analysis and machine learning are imported by this code. A dataset is loaded and divided into training and testing sets (most likely via the datasets module of scikit-learn). Additionally, models such as ElasticNet, DecisionTreeRegressor, and RandomForestRegressor are imported, used to predict multiple target variables, and evaluated using metrics such as mean squared error and mean absolute error.

Dataset loading and Splitting




# Load the Linnerud dataset
linnerud = datasets.load_linnerud()
X, y = linnerud.data, linnerud.target
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

This code separates the input data (X) and target variables (y) and loads the Linnerud dataset using the scikit-learn datasets module. Next, using a random state of 42, it divides the dataset into training and testing sets in an 80-20 ratio to ensure reproducibility. This makes it possible to train and assess machine learning models.

Exploratory Data Analysis

An essential phase in data analysis is Exploratory Data Analysis (EDA), which offers a thorough grasp of a dataset’s properties. It entails highlighting key characteristics, finding patterns, and spotting trends. EDA aids with data distribution analysis, missing value detection, and outlier detection. Through the use of charts such as scatter plots and histograms, EDA provides feature selection guidance and insights into relationship patterns within the data. All things considered, EDA improves the process of making data-driven decisions by providing insights into the latter phases of data preparation and model construction.

Correlation Matrix

Visualizing correlation matrix will help us to understand the relationships between the different features present in the dataset.




# Create a DataFrame for the Linnerud dataset
df = pd.DataFrame(data=X, columns=linnerud.feature_names)
# Calculate the correlation matrix
correlation_matrix = df.corr()
 
# Plot a heatmap of the correlation matrix
plt.figure(figsize=(4, 3))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

Output:

Correlation matrix for Linnerud Dataset

These lines of code compute the correlation matrix for the characteristics in the Linnerud dataset and generate a DataFrame from those features. The correlation matrix can be plotted as a heatmap using the sns.heatmap function. The heatmap shows the relationships between several features visually, with values labeled in each cell and color denoting the correlation’s intensity and direction. This aids in determining the connections between the dataset’s variables.

Distribution of Target variable

The target variable of this dataset has three types. Visualizing it will help us to understand its behavior.




df = pd.DataFrame(data=X, columns=linnerud.target_names)
# Plot the distribution of the target variables
plt.figure(figsize=(10, 4))
plt.subplot(1, 3, 1)
sns.histplot(df['Weight'], kde=True, color='green')
plt.title('Distribution of Weight')
 
plt.subplot(1, 3, 2)
sns.histplot(df['Waist'], kde=True, color='green')
plt.title('Distribution of Waist')
 
plt.subplot(1, 3, 3)
sns.histplot(df['Pulse'], kde=True, color='green')
plt.title('Distribution of Pulse')
 
plt.tight_layout()
plt.show()

Output:

Distribution of the target variable

The target variables, “Weight,” “Waist,” and “Pulse,” from the Linnerud dataset, are used to create a DataFrame with the name df in this code. After that, a figure is made that has three subplots to show how these target variables are distributed. With the ‘kde=True’ option added to add a kernel density estimate to improve the visualization, each subplot shows a histogram of one of the variables. All of the subplots have the color green set. The subplots, titled “Distribution of Weight,” “Distribution of Waist,” and “Distribution of Pulse,” are titled according to the variable that is being observed. Plotting subplots is made easier with the help of the plt.tight_layout() method. The full figure is shown using plt.show(), which also gives us further insight into the target variables’ distribution.

Model training

Now we will train the Multioutput regressor model of SKlearn. And as it is discussed earlier we will also employ tree-based traditional machine learning model Decision Tree and Ensemble model Random Forest to show performance comparison later on this article.




# Create and train the multioutput regression model (ElasticNet)
multioutput_model = MultiOutputRegressor(
    ElasticNet(alpha=0.5, l1_ratio=0.5), n_jobs=5)
multioutput_model.fit(X_train, y_train)
 
# Create and train the decision tree regressor model
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
 
# Create and train the random forest regressor model
forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

In this code, we create and train three different multioutput regression models for the Linnerud dataset. First, the multioutput_model employs ElasticNet regression, configured with specific alpha and L1 ratio values, and is designed to handle multiple target variables simultaneously. It is parallelized with five jobs to enhance efficiency. Second, the tree_model is a DecisionTreeRegressor, a single decision tree-based regression model that learns the relationships between input features and target variables. Finally, the forest_model is a RandomForestRegressor consisting of 100 decision trees, offering an ensemble approach to regression. These models are trained on the training data and are ready for making predictions on the testing dataset, enabling comparison of their performance in predicting the target variables. We have use a ElasticNet linear model as a base estimator of Multioutput regressor model. And set n_jobs parameter to 5 means total five parallel computing will be occurred.

Using a wrapper class offered by scikit-learn, the MultiOutputRegressor expands on standard machine learning models to address multioutput issues. It is used with the ElasticNet regression model in the provided code. Let’s dissect the parts:

Multioutput regression problems that need simultaneous prediction of numerous continuous variables can be addressed by using the MultiOutputRegressor, which extends the ElasticNet model to handle multiple target variables.

Model evaluation

Now we will evaluate all the regression models in the terms of MSE and MAE which are two commonly used regression model performance metrics.




# Make predictions
multioutput_pred = multioutput_model.predict(X_test)
tree_pred = tree_model.predict(X_test)
forest_pred = forest_model.predict(X_test)
 
# Calculate performance metrics for multioutput model
multioutput_mse = mean_squared_error(y_test, multioutput_pred)
multioutput_mae = mean_absolute_error(y_test, multioutput_pred)
 
 
# Calculate performance metrics for decision tree model
tree_mse = mean_squared_error(y_test, tree_pred)
tree_mae = mean_absolute_error(y_test, tree_pred)
 
# Calculate performance metrics for random forest model
forest_mse = mean_squared_error(y_test, forest_pred)
forest_mae = mean_absolute_error(y_test, forest_pred)
 
# Print the performance metrics
print("Multioutput Model - Mean Squared Error:", multioutput_mse)
print("Multioutput Model - Mean Absolute Error:", multioutput_mae)
print("Decision Tree Model - Mean Squared Error:", tree_mse)
print("Decision Tree Model - Mean Absolute Error:", tree_mae)
print("Random Forest Model - Mean Squared Error:", forest_mse)
print("Random Forest Model - Mean Absolute Error:", forest_mae)

Output:

Multioutput Model - Mean Squared Error: 236.22543973611653
Multioutput Model - Mean Absolute Error: 10.015359327324276
Decision Tree Model - Mean Squared Error: 371.9166666666667
Decision Tree Model - Mean Absolute Error: 12.083333333333334
Random Forest Model - Mean Squared Error: 242.75831666666667
Random Forest Model - Mean Absolute Error: 10.656666666666666

This code uses the test data to assess the performance of three regression models. Among the models are the Random Forest, Decision Tree, and Multioutput ElasticNet models. Each model generates predictions, and the Mean Squared Error (MSE) and Mean Absolute Error (MAE), two popular regression metrics, are used to evaluate each model’s performance. Better model fit and forecast accuracy are shown by lower MSE and MAE values. These metrics can be compared between the three models’ efficacy in resolving the multioutput regression problem because the code computes and outputs them for each model. Based on the evaluation’s prediction accuracy, the best model for the job can be chosen.

Performance comparison visualization

Now we will visualize how models have performed and which model outperform from others.




# Create a comparative visualization
plt.figure(figsize=(10, 4))
models = ['Multioutput', 'Decision Tree', 'Random Forest']
mse_scores = [multioutput_mse, tree_mse, forest_mse]
mae_scores = [multioutput_mae, tree_mae, forest_mae]
 
# Plot Mean Squared Error (MSE)
plt.subplot(1, 2, 1)
plt.bar(models, mse_scores, color=['blue', 'green', 'purple'])
plt.xlabel('Models')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Comparative MSE Scores')
 
# Plot R-squared Score (R2)
plt.subplot(1, 2, 2)
plt.bar(models, mae_scores, color=['blue', 'green', 'purple'])
plt.xlabel('Models')
plt.ylabel('Mean Absolute Error(MAE)')
plt.title('Comparative MAE Scores')
 
plt.tight_layout()
plt.show()

Output:

Using the three regression models—Multioutput, Decision Tree, and Random Forest—this code generates a comparative visualization to evaluate their respective performances. Two bar charts are produced side by side. Each model’s Mean Squared Error (MSE) scores are shown on the left chart, while the Mean Absolute Error (MAE) values are shown on the right chart. On the x-axis, the models are labeled, and bars with different colors represent the relevant MSE and MAE scores. With lower MSE and MAE values indicating greater performance, this representation makes it possible to compare the models’ predicted accuracy directly.

Conclusion

We can conclude that solving multioutput regression task is a computationally costly task but at the same time it is very important for real-world problem solving. The Multioutput regressor of SKlearn can perform well even in comparison with tree-based models. So, this model can be a good weapon to destroy the complexity of multioutput regression tasks in a easy way. However, to achieve good model performance we need further hyper-parameter tuning or deeper data pre-processing.


Article Tags :