HuberRegressor vs Ridge on Dataset with Strong Outliers in Scikit Learn

Last Updated : 21 Mar, 2023

Regression is a commonly used machine learning technique for predicting continuous outputs. In some datasets, outliers can have a significant impact on the results. To handle such datasets with outliers, two common algorithms are Huber Regressor and Ridge Regression. This article will explore the differences between the two algorithms in the context of outliers and when to use one over the other with examples.

Huber Regressor

Huber Regressor is a robust regression algorithm that is less sensitive to outliers compared to traditional linear regression. It uses different loss functions, which combine the properties of Mean Squared Error (MSE) and Mean Absolute Error (MAE). For small Deviations, it acts like MSE whereas, for larger deviations, it acts as MAE. The Huber loss function provides more efficient results than traditional linear regression in the presence of outliers.

The algorithm can be used when the data has outliers and traditional linear regression would produce biased results. It is also useful when the outliers in the dataset have a larger impact on the result than the non-outliers.

Ridge Regression

Ridge Regression is a linear regression model which uses a regularization method to prevent the overfitting problem. The loss function is modified to add a penalty term to the cost function of the linear regression to reduce the magnitude of the coefficients toward zero. This helps to create a simple model which reduces the variance but is more biased.

Ridge Regression is useful when the dataset has a high number of features and is prone to overfitting. It is also useful when the outliers in the dataset have a smaller impact on the result than the non-outliers.

When to use which algorithm

The choice of algorithm depends on the type of outliers in the dataset and the impact they have on the results. If the outliers in the dataset have a larger impact on the result than the non-outliers, then Huber Regressor is the preferred algorithm. If the outliers have a smaller impact on the result and the dataset has a high number of features, then Ridge Regression is the preferred algorithm.

Examples

Example 1: Outliers have a larger impact on the result

Consider a dataset where the house prices are recorded. In this dataset, there are a few properties that are significantly more expensive compared to the rest. In such a scenario, if traditional linear regression is used, the results would be biased towards the more expensive properties, which would affect the overall accuracy of the model. To handle such a scenario, Huber Regressor can be used, which is less sensitive to outliers.

Example 2: Dataset with a high number of features

Consider a dataset where the prices of multiple products are recorded. The dataset has a high number of features such as product type, brand, and manufacturing location, among others. In such a scenario, if traditional linear regression is used, the results may be overfitting to the training data and may not generalize well to the testing data. To handle such a scenario, Ridge Regression can be used, which helps reduce the model’s variance.

Example 3: Outliers have a more minor impact on the result

Consider a dataset where the sales of multiple products are recorded. In this dataset, there may be a few products that have significantly higher sales compared to the rest. However, the outliers have a smaller impact on the result than the non-outliers. In such a scenario, Ridge Regression can be used, which helps to reduce the variance of the model.

Functions used:

The scikit-learn library provides the ‘HuberRegressor’ and ‘Ridge’ classes for implementing Huber Regressor and Ridge Regression, respectively. The ‘fit’ method is used to fit the model to the training data. The ‘regressor’ is an instance of the ‘HuberRegressor‘ or ‘Ridge’ class, ‘X_train‘ is the training data and ‘y_train’ is the target variable.

regressor.fit(X_train, y_train)

The ‘predict’ method is used to make predictions on new data. The ‘X_test’ is the test data and ‘y_pred’ is the predicted target variable.

y_pred = regressor.predict(X_test)

The scikit-learn library also provides the mean_squared_error and mean_absolute_error functions for calculating the mean squared error and mean absolute error, respectively. The y_test is the actual target variable and y_pred is the predicted target variable.

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

Example: How you can visualize the difference between the HuberRegressor and Ridge regression models in scikit-learn:

Python3

# import modules 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.datasets import make_regression 
from sklearn.linear_model import HuberRegressor, Ridge 
from sklearn.metrics import mean_squared_error, mean_absolute_error 
  
# Generate a sample dataset with strong outliers 
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=0) 
y[y > 200] = 250
  
# Split the data into training and testing sets 
X_train, X_test = X[:80], X[80:] 
y_train, y_test = y[:80], y[80:] 
  
# Fit a HuberRegressor to the training data 
huber_regressor = HuberRegressor(epsilon=1.35) 
huber_regressor.fit(X_train, y_train) 
  
# Fit a Ridge Regression to the training data 
ridge_regressor = Ridge(alpha=0.1) 
ridge_regressor.fit(X_train, y_train) 
  
# Make predictions on the test data 
y_pred_huber = huber_regressor.predict(X_test) 
y_pred_ridge = ridge_regressor.predict(X_test) 
  
# Calculate the mean squared error and mean absolute error 
mse_huber = mean_squared_error(y_test, y_pred_huber) 
mae_huber = mean_absolute_error(y_test, y_pred_huber) 
mse_ridge = mean_squared_error(y_test, y_pred_ridge) 
mae_ridge = mean_absolute_error(y_test, y_pred_ridge) 
  
print("Huber Regressor MSE:", mse_huber) 
print("Huber Regressor MAE:", mae_huber) 
print("Ridge Regression MSE:", mse_ridge) 
print("Ridge Regression MAE:", mae_ridge) 
  
# Plot the actual values and predictions 
plt.scatter(X_test, y_test, color='black') 
plt.plot(X_test, y_pred_huber, color='blue', label='Huber Regressor') 
plt.plot(X_test, y_pred_ridge, color='green', label='Ridge Regression') 
plt.legend() 
plt.show() 

Output:

This code generates a sample dataset with strong outliers, splits it into training and testing sets, fits a HuberRegressor and Ridge to the training data, and makes predictions on the test data. It then plots the actual values and predictions as a scatter plot, with the Huber Regressor predictions in blue and the Ridge Regression predictions in green. You can run this code and see the difference between the two algorithms. You can also try changing the parameters of each algorithm to see how they affect the results.

Example: using Huber Regressor and Ridge Regression on the Boston Housing dataset:

Python

# import modules 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.datasets import load_boston 
from sklearn.linear_model import HuberRegressor, Ridge 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error 
  
# Load the Boston Housing dataset 
boston = load_boston() 
X, y = boston.data, boston.target 
  
# Split the data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 
  
# Fit a HuberRegressor to the training data 
huber_regressor = HuberRegressor(epsilon=1.35) 
huber_regressor.fit(X_train, y_train) 
  
# Fit a Ridge Regression to the training data 
ridge_regressor = Ridge(alpha=0.1) 
ridge_regressor.fit(X_train, y_train) 
  
# Make predictions on the test data 
y_pred_huber = huber_regressor.predict(X_test) 
y_pred_ridge = ridge_regressor.predict(X_test) 
  
# Plot the actual values and predictions 
plt.scatter(y_test, y_pred_huber, color='blue', label='Huber Regressor') 
plt.scatter(y_test, y_pred_ridge, color='green', label='Ridge Regression') 
plt.plot([0, 50], [0, 50], 'k--') 
plt.xlabel('Actual Value') 
plt.ylabel('Predicted Value') 
plt.legend() 
plt.show() 
  
# Calculate the mean squared error 
mse_huber = mean_squared_error(y_test, y_pred_huber) 
mse_ridge = mean_squared_error(y_test, y_pred_ridge) 
  
# Print the mean squared error 
print("Mean Squared Error - Huber Regressor: {:.2f}".format(mse_huber)) 
print("Mean Squared Error - Ridge Regression: {:.2f}".format(mse_ridge)) 

Output:

Conclusion:

Huber Regressor and Ridge Regression are two commonly used algorithms for handling datasets with outliers. The choice of algorithm depends on the type of outliers in the dataset and the impact they have on the result. Huber Regressor is useful when the outliers have a larger impact on the result, while Ridge Regression is useful when the dataset has a high number of features or when the outliers have a more minor impact on the result.

Suggest improvement

Compare effect of different scalers on data with outliers in Scikit Learn

Share your thoughts in the comments