Related Articles

# Ensemble Methods in Python

• Last Updated : 15 Sep, 2021

Ensemble means a group of elements viewed as a whole rather than individually. An Ensemble method creates multiple models and combines them to solve it. Ensemble methods help to improve the robustness/generalizability of the model. In this article, we will discuss some methods with their implementation in Python. For this, we choose a dataset from the UCI repository.

### Basic ensemble methods

1. Averaging method: It is mainly used for regression problems. The method consists of build multiple models independently and returns the average of the prediction of all the models. In general, the combined output is better than an individual output because variance is reduced.

In the below example, three regression models (linear regression, xgboost, and random forest) are trained and their predictions are averaged. The final prediction output is pred_final.

## Python3

 `# importing utility modules``import` `pandas as pd``from` `sklearn.model_selection ``import` `train_test_split``from` `sklearn.metrics ``import` `mean_squared_error` `# importing machine learning models for prediction``from` `sklearn.ensemble ``import` `RandomForestRegressor``import` `xgboost as xgb``from` `sklearn.linear_model ``import` `LinearRegression` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"target"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"target"``)` `# Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``0.20``)` `# initializing all the model objects with default parameters``model_1 ``=` `LinearRegression()``model_2 ``=` `xgb.XGBRegressor()``model_3 ``=` `RandomForestRegressor()` `# training all the model on the training dataset``model_1.fit(X_train, y_target)``model_2.fit(X_train, y_target)``model_3.fit(X_train, y_target)` `# predicting the output on the validation dataset``pred_1 ``=` `model_1.predict(X_test)``pred_2 ``=` `model_2.predict(X_test)``pred_3 ``=` `model_3.predict(X_test)` `# final prediction after averaging on the prediction of all 3 models``pred_final ``=` `(pred_1``+``pred_2``+``pred_3)``/``3.0` `# printing the root mean squared error between real value and predicted value``print``(mean_squared_error(y_test, pred_final))`

Output:

`4560`

2. Max voting: It is mainly used for classification problems. The method consists of build multiple models independently and getting their individual output called ‘vote’. The class with maximum votes is returned as output.

In the below example, three classification models (logistic regression, xgboost, and random forest) are combined using sklearn VotingClassifier, that model is trained and the class with maximum votes is returned as output. The final prediction output is pred_final. Please note it’s a classification, not regression, so the loss may be different from other types of ensemble methods.

## Python

 `# importing utility modules``import` `pandas as pd``from` `sklearn.model_selection ``import` `train_test_split``from` `sklearn.metrics ``import` `log_loss` `# importing machine learning models for prediction``from` `sklearn.ensemble ``import` `RandomForestClassifier``from` `xgboost ``import` `XGBClassifier``from` `sklearn.linear_model ``import` `LogisticRegression` `# importing voting classifier``from` `sklearn.ensemble ``import` `VotingClassifier` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"Weekday"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"Weekday"``)` `# Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``0.20``)` `# initializing all the model objects with default parameters``model_1 ``=` `LogisticRegression()``model_2 ``=` `XGBClassifier()``model_3 ``=` `RandomForestClassifier()` `# Making the final model using voting classifier``final_model ``=` `VotingClassifier(``    ``estimators``=``[(``'lr'``, model_1), (``'xgb'``, model_2), (``'rf'``, model_3)], voting``=``'hard'``)` `# training all the model on the train dataset``final_model.fit(X_train, y_train)` `# predicting the output on the test dataset``pred_final ``=` `final_model.predict(X_test)` `# printing log loss between actual and predicted value``print``(log_loss(y_test, pred_final))`

Output:

`231`

Let’s have a look at a bit more advanced ensemble methods

Ensemble methods are extensively used in classical machine learning. Examples of algorithms using bagging are random forest and bagging meta-estimator and examples of algorithms using boosting are GBM, XGBM, Adaboost, etc.

As a developer of a machine learning model, it is highly recommended to use ensemble methods. The ensemble methods are used extensively in almost all competitions and research papers.

1. Stacking: It is an ensemble method that combines multiple models (classification or regression) via meta-model (meta-classifier or meta-regression). The base models are trained on the complete dataset, then the meta-model is trained on features returned (as output) from base-models. The base-models in stacking are typically different. The meta-model helps to find the features from base-models to achieve the best accuracy.

Algorithm:

1. Split the train dataset into n parts
2. A base model (say linear regression) is fitted on n-1 parts and predictions are made for the nth part. This is done for each one of the n part of the train set.
3. The base model is then fitted on the whole train dataset.
4. This model is used to predict the test dataset.
5. The Steps 2 to 4 are repeated for another base model which results in another set of predictions for the train and test dataset.
6. The predictions on train data set are used as a feature to build the new model.
7. This final model is used to make the predictions on test dataset

Stacking is a bit different from the basic ensembling methods because it has first level and second level models. Stacking features are first extracted by training the dataset with all the first level model. A first-level model is then using the train stacking features to train the model than this model predicts the final output with test stacking features.

## Python3

 `# importing utility modules``import` `pandas as pd``from` `sklearn.model_selection ``import` `train_test_split``from` `sklearn.metrics ``import` `mean_squared_error` `# importing machine learning models for prediction``from` `sklearn.ensemble ``import` `RandomForestRegressor``import` `xgboost as xgb``from` `sklearn.linear_model ``import` `LinearRegression` `# importing stacking lib``from` `vecstack ``import` `stacking` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"target"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"target"``)` `# Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``0.20``)`  `# initializing all the base model objects with default parameters``model_1 ``=` `LinearRegression()``model_2 ``=` `xgb.XGBRegressor()``model_3 ``=` `RandomForestRegressor()` `# putting all base model objects in one list``all_models ``=` `[model_1, model_2, model_3]` `# computing the stack features``s_train, s_test ``=` `stacking(all_models, X_train, X_test,``                           ``y_train, regression``=``True``, n_folds``=``4``)` `# initializing the second-level model``final_model ``=` `model_1` `# fitting the second level model with stack features``final_model ``=` `final_model.fit(s_train, y_train)` `# predicting the final output using stacking``pred_final ``=` `final_model.predict(X_test)` `# printing the root mean squared error between real value and predicted value``print``(mean_squared_error(y_test, pred_final))`

Output:

`4510 `

2. Blending: It is similar to the stacking method explained above, but rather than using the whole dataset for training the base-models, a validation dataset is kept separate to make predictions.

Algorithm:

1. Split the training dataset into train, test and validation dataset.
2. Fit all the base models using train dataset.
3. Make predictions on validation and test dataset.
4. These predictions are used as features to build a second level model
5. This model is used to make predictions on test and meta-features

## Python3

 `# importing utility modules``import` `pandas as pd``from` `sklearn.metrics ``import` `mean_squared_error` `# importing machine learning models for prediction``from` `sklearn.ensemble ``import` `RandomForestRegressor``import` `xgboost as xgb``from` `sklearn.linear_model ``import` `LinearRegression` `# importing train test split``from` `sklearn.model_selection ``import` `train_test_split` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"target"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"target"``)` `#Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(train, target, test_size``=``0.20``)` `# performing the train test and validation split``train_ratio ``=` `0.70``validation_ratio ``=` `0.20``test_ratio ``=` `0.10` `# performing train test split``x_train, x_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``1` `-` `train_ratio)` `# performing test validation split``x_val, x_test, y_val, y_test ``=` `train_test_split(``    ``x_test, y_test, test_size``=``test_ratio``/``(test_ratio ``+` `validation_ratio))` `# initializing all the base model objects with default parameters``model_1 ``=` `LinearRegression()``model_2 ``=` `xgb.XGBRegressor()``model_3 ``=` `RandomForestRegressor()` `# training all the model on the train dataset` `# training first model``model_1.fit(x_train, y_train)``val_pred_1 ``=` `model_1.predict(x_val)``test_pred_1 ``=` `model_1.predict(x_test)` `# converting to dataframe``val_pred_1 ``=` `pd.DataFrame(val_pred_1)``test_pred_1 ``=` `pd.DataFrame(test_pred_1)` `# training second model``model_2.fit(x_train, y_train)``val_pred_2 ``=` `model_2.predict(x_val)``test_pred_2 ``=` `model_2.predict(x_test)` `# converting to dataframe``val_pred_2 ``=` `pd.DataFrame(val_pred_2)``test_pred_2 ``=` `pd.DataFrame(test_pred_2)` `# training third model``model_3.fit(x_train, y_train)``val_pred_3 ``=` `model_1.predict(x_val)``test_pred_3 ``=` `model_1.predict(x_test)` `# converting to dataframe``val_pred_3 ``=` `pd.DataFrame(val_pred_3)``test_pred_3 ``=` `pd.DataFrame(test_pred_3)` `# concatenating validation dataset along with all the predicted validation data (meta features)``df_val ``=` `pd.concat([x_val, val_pred_1, val_pred_2, val_pred_3], axis``=``1``)``df_test ``=` `pd.concat([x_test, test_pred_1, test_pred_2, test_pred_3], axis``=``1``)` `# making the final model using the meta features``final_model ``=` `LinearRegression()``final_model.fit(df_val, y_val)` `# getting the final output``final_pred ``=` `final_model.predict(df_test)` `#printing the root mean squared error between real value and predicted value``print``(mean_squared_error(y_test, pred_final))`

Output:

`4790 `

3. Bagging: It is also known as a bootstrapping method. Base-models are run on bags to get a fair distribution of the whole dataset. A bag is a subset of the dataset along with a replacement to make the size of the bag the same as the whole dataset. The final output is formed after combining the output of all base-models.

Algorithm:

1. Create multiple datasets from the train dataset by selecting observations with replacements
2. Run a base model on each of the created datasets independently
3. Combine the predictions of all the base models to each the final output

Bagging normally uses only one base model (XGBoost Regressor used in the code below).

## Python

 `# importing utility modules``import` `pandas as pd``from` `sklearn.model_selection ``import` `train_test_split``from` `sklearn.metrics ``import` `mean_squared_error` `# importing machine learning models for prediction``import` `xgboost as xgb` `# importing bagging module``from` `sklearn.ensemble ``import` `BaggingRegressor` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"target"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"target"``)` `# Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``0.20``)` `# initializing the bagging model using XGboost as base model with default parameters``model ``=` `BaggingRegressor(base_estimator``=``xgb.XGBRegressor())` `# training model``model.fit(X_train, y_train)` `# predicting the output on the test dataset``pred ``=` `model.predict(X_test)` `# printing the root mean squared error between real value and predicted value``print``(mean_squared_error(y_test, pred_final))`

Output:

`4666 `

4. Boosting: Boosting is a sequential method–it aims to prevent a wrong base-model from affecting the final output. Instead of combing the base-models, the method focuses on building a new model that is dependent on the previous one. A new model tries to remove the errors made by its previous one. Each of these models is called weak learners. The final model (aka strong learner) is formed by getting the weighted mean of all the weak learners.

Algorithm:

1. Take a subset of the train dataset.
2. Train a base model on that dataset.
3. Use third model to make predictions on the whole dataset.
4. Calculate errors using the predicted values and actual values.
5. Initialize all data points with same weight.
6. Assign higher weight to incorrectly predicted data points.
7. Make another model, make predictions using the new model in such a way that errors made by the previous model are mitigated/corrected.
8. Similarly, create multiple models–each successive model correcting the errors of the previous model.
9. The final model (strong learner) is the weighted mean of all the previous models (weak learners).

## Python3

 `# importing utility modules``import` `pandas as pd``from` `sklearn.model_selection ``import` `train_test_split``from` `sklearn.metrics ``import` `mean_squared_error` `# importing machine learning models for prediction``from` `sklearn.ensemble ``import` `GradientBoostingRegressor` `# loading train data set in dataframe from train_data.csv file``df ``=` `pd.read_csv(``"train_data.csv"``)` `# getting target data from the dataframe``target ``=` `df[``"target"``]` `# getting train data from the dataframe``train ``=` `df.drop(``"target"``)` `# Splitting between train data into training and validation dataset``X_train, X_test, y_train, y_test ``=` `train_test_split(``    ``train, target, test_size``=``0.20``)` `# initializing the boosting module with default parameters``model ``=` `GradientBoostingRegressor()` `# training the model on the train dataset``model.fit(X_train, y_train)` `# predicting the output on the test dataset``pred_final ``=` `model.predict(X_test)` `# printing the root mean squared error between real value and predicted value``print``(mean_squared_error(y_test, pred_final))`

Output:

`4789 `

Note: The scikit-learn provides several modules/methods for ensemble methods. Please note the accuracy of a method does not suggest one method being superior to another. The article aims to give a brief introduction to ensemble methods–not to compare between them. The programmer must use a method that suits the data.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

My Personal Notes arrow_drop_up