Open In App

How to Mitigate Overfitting by Creating Ensembles

A typical problem in machine learning is called overfitting, which occurs when a model learns the training data too well and performs badly on fresh, untried data. Using ensembles is a useful tactic to reduce overfitting. Ensembles increase robustness and generalization by combining predictions from many models. This tutorial looks at setting up ensembles in Scikit-Learn to deal with overfitting.

What is overfitting?

When a machine learning model learns the training data too well, it becomes overfitted and captures noise and unimportant patterns that do not transfer to fresh, unobserved data. Because the model is unable to generalize outside of the training set, this may result in worse performance on fresh datasets.



Why should we Mitigate Overfitting

Overfitting is a key issue in machine learning models because it has a negative influence on the model’s capacity to generalize to new data. Overfitting mitigation is crucial for a number of reasons:

What are Ensembles?

Ensembles are machine learning technique where the predictions from various predictors, such as classifiers or regressors, are combined by aggregating the predictions of a set of models to produce outcomes that are superior to those of any individual predictor. An ensemble is a collection of forecasters whose combined forecasts enhance performance. and the term “ensemble method” refers to the general methodology used in this ensemble learning. Bringing together a number of weak learners to become strong learners is the fundamental idea behind ensemble learning.



Types of Ensembles

Stepwise Guide of How to apply different Ensemble Methods

Importing neccesary libraries




from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.ensemble import StackingClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier

Loading and Splitting the dataset




# Load dataset
data = load_iris()
X, y = data.data, data.target
 
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Implementing Various Ensemble Methods

  1. Bagging with Random Forests: Uses BaggingClassifier with RandomForestClassifier as the base estimator to train an ensemble of decision trees.
  2. Boosting algorithms: Includes AdaBoostClassifier with DecisionTreeClassifier as the base estimator, GradientBoostingClassifier, and XGBClassifier (XGBoost).
  3. Stacking: Uses StackingClassifier to combine predictions from RandomForestClassifier, SVC, and LogisticRegression using LogisticRegression as the final estimator.
  4. Dropout: Implements a neural network model using Sequential from Keras with dropout layers to prevent overfitting.
  5. Voting: Combines predictions from RandomForestClassifier, SVC, and LogisticRegression using hard voting.
  6. Ensemble of Diverse Models: Includes an ensemble of a SVC and a DecisionTreeClassifier.




# Bagging with Random Forests
bagging_model = BaggingClassifier(base_estimator=RandomForestClassifier(), n_estimators=10)
bagging_model.fit(X_train, y_train)
bagging_predictions = bagging_model.predict(X_test)
 
# Boosting algorithms
adaboost_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50)
adaboost_model.fit(X_train, y_train)
adaboost_predictions = adaboost_model.predict(X_test)
 
gradient_boost_model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gradient_boost_model.fit(X_train, y_train)
gradient_boost_predictions = gradient_boost_model.predict(X_test)
 
xgboost_model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgboost_model.fit(X_train, y_train)
xgboost_predictions = xgboost_model.predict(X_test)
 
# Stacking
base_models = [('rf', RandomForestClassifier()), ('svc', SVC()), ('lr', LogisticRegression())]
stacking_model = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())
stacking_model.fit(X_train, y_train)
stacking_predictions = stacking_model.predict(X_test)
 
# Dropout
dropout_model = Sequential([
    Dense(128, input_dim=4, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(3, activation='softmax')
])
dropout_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
dropout_model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)
_, dropout_accuracy = dropout_model.evaluate(X_test, y_test)
 
# Voting
voting_model = VotingClassifier(estimators=base_models, voting='hard')
voting_model.fit(X_train, y_train)
voting_predictions = voting_model.predict(X_test)
 
# Ensemble of Diverse Models
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
 
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

Comparing Accuracy




# Evaluate models
print("Bagging Accuracy:", accuracy_score(y_test, bagging_predictions))
print("AdaBoost Accuracy:", accuracy_score(y_test, adaboost_predictions))
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gradient_boost_predictions))
print("XGBoost Accuracy:", accuracy_score(y_test, xgboost_predictions))
print("Stacking Accuracy:", accuracy_score(y_test, stacking_predictions))
print("Dropout Accuracy:", dropout_accuracy)
print("Voting Accuracy:", accuracy_score(y_test, voting_predictions))

Output:

Bagging Accuracy: 1.0
AdaBoost Accuracy: 1.0
Gradient Boosting Accuracy: 0.9666666666666667
XGBoost Accuracy: 1.0
Stacking Accuracy: 1.0
Dropout Accuracy: 0.9666666388511658
Voting Accuracy: 1.0

When to Use Which Ensemble Method?

Depending on the nature of the issue, the properties of the data, and the computer resources available, the best ensemble approach will be chosen. Determining the optimal group strategy for a given task requires experimentation and cross-validation.

Ensemble Method

When to use?

Bagging

Works well when the basic model (like Random Forests) is complicated and prone to overfitting. In cases with large volatility, it performs well.

Boosting

Beneficial when there is space for development and the basic model is poor. Boosting can handle high-dimensional data effectively and is helpful in eliminating bias.

Stacking

Stacking works well when different models can provide original insights. When there is sufficient data to train many models, it works well.

Dropout

An effective way to stop overfitting in neural networks. Deep learning situations often employ it.

Voting

A quick and easy way to combine different models. When majority votes are trusted, hard voting is appropriate.

Ensemble of Diverse Models

Suggested for mixing models with various advantages and disadvantages. When working with intricate and diverse datasets, it is helpful.

Conclusion:

Overfitting may be reduced by assembling machine learning models into ensembles, for example, by integrating gradient boosting, random forests, and decision trees. By using the advantages of each individual model, the ensemble technique improves resilience and generalization. This tutorial offers a detailed implementation that makes use of Scikit-Learn and shows how to train individual models, assemble an ensemble, and assess the performance of the model. Using the Iris dataset as an example, the example demonstrates how the ensemble may attain high accuracy without overfitting.


Article Tags :