Implementing the AdaBoost Algorithm From Scratch
AdaBoost models belong to a class of ensemble machine learning models. From the literal meaning of the word ‘ensemble’, we can easily have much better intuition of how this model works. Ensemble models take the onus of combining different models and later produce an advanced/more accurate meta model. This meta model has comparatively high accuracy in terms of prediction as compared to their corresponding counterparts. We have read about the working of these ensemble models in the article Ensemble Classifier | Data Mining.
AdaBoost algorithm falls under ensemble boosting techniques, as discussed it combines multiple models to produce more accurate results and this is done in two phases:
- Multiple weak learners are allowed to learn on training data
- Combining these models to generate a meta-model, this meta-model aims to resolve the errors as performed by the individual weak learners.
Note: For more information, refer Boosting ensemble models
In this article, we are going to learn about the practical implementation of AdaBoost classifier over a dataset.
In this problem, we are given a dataset containing 3 species of flowers and features of these flowers such as- sepal length, sepal width, petal length, and petal width, and we have to classify the flowers into these species. The dataset can be downloaded from here
Let’s begin with importing important libraries that we will require to do our classification task:
After, importing the libraries we will load our dataset using the pandas read_csv method as:
We can see our dataset contains 150 rows and 6 columns. Let us take a look at our actual content in the dataset using head() method as:
The first column is the Id column which has no relevance with flowers so, we will drop it. The Species column is our target feature and tells us about the species to which the flowers belong.
Shape of X is (150, 4) and shape of y is (150,)
Number of unique species in dataset are: 3
Iris-virginica 50 Iris-setosa 50 Iris-versicolor 50 Name: Species, dtype: int64
Let’s dig deep in our dataset, and we can see in the above image that our dataset contains 3 classes into which our flowers are distributed also, since we have 150 samples all three species have an equal number of samples in the dataset, so we have no class imbalance.
Now, we will split the dataset for training and validation purpose, the validation set is 25% of the total dataset.
After creating the training and validation set we will build our AdaBoost classifier model and fit it over the train set for learning.
As we fit our model on the train set, we will check the accuracy of our model on the validation set.
The accuracy of the model on validation set is 0.9210526315789473
As we can see the model has an accuracy of 92% on the validation set which is quite good with no hyper parameter tuning and feature engineering.