We often encounter datasets that contain categorical features and to fit these datasets into the Boosting model we apply various encoding techniques to the dataset such as One-Hot Encoding or Label Encoding. But applying One-Hot encoding creates a sparse matrix which may sometimes lead to the overfitting of the model to handle this issue we use CatBoost. CatBoost automatically handles categorical features.
What is CatBoost
CatBoost or Categorical Boosting is an open-source boosting library developed by Yandex. It is designed for use on problems like regression and classification having a very large number of independent features.
Catboost is a variant of gradient boosting that can handle both categorical and numerical features. It does not require any feature encodings techniques like One-Hot Encoder or Label Encoder to convert categorical features into numerical features. It also uses an algorithm called symmetric weighted quantile sketch(SWQS) which automatically handles the missing values in the dataset to reduce overfitting and improve the overall performance of the dataset.
Features of CatBoost
- Built-in Method for handling categorical features – CatBoost can handle categorical features without any feature encoding
- Built-in methods for Handling missing values – Unlike other Models, CatBoost can easily handle any missing values in the dataset
- Automatica feature scaling – CatBoost internal scales all the columns to the same scaling whereas in other models we need to convert columns extensively
- Built-in cross-validation – CatBoost internally applies a cross-validation method to choose the best hyperparameters for the model.
- Regularizations – CatBoost supports both L1 and L2 regularization methods to reduce overfitting
- It can be used in both Python and R language.
CatBoost Comparison results with other Boosting Algorithm
Default CatBoost |
Tuned CatBoost |
Default LightGBM |
Tuned LightGBM |
Default XGBoost |
Tuned XGBoost |
Default H2O |
Adult |
0.272978 (±0.0004) (+1.20%) |
0.269741 (±0.0001) |
0.287165 (±0.0000) (+6.46%) |
0.276018 (±0.0003) (+2.33%) |
0.280087 (±0.0000) (+3.84%) |
0.275423 (±0.0002) (+2.11%) |
Amazon |
0.138114 (±0.0004) (+0.29%) |
0.137720 (±0.0005) |
0.167159 (±0.0000) (+21.38%) |
0.163600 (±0.0002) (+18.79%) |
0.165365 (±0.0000) (+20.07%) |
0.163271 (±0.0001) (+18.55%) |
Appet |
0.071382 (±0.0002) (-0.18%) |
0.071511 (±0.0001) |
0.074823 (±0.0000) (+4.63%) |
0.071795 (±0.0001) (+0.40%) |
0.074659 (±0.0000) (+4.40%) |
0.071760 (±0.0000) (+0.35%) |
Click |
0.391116 (±0.0001) (+0.05%) |
0.390902 (±0.0001) |
0.397491 (±0.0000) (+1.69%) |
0.396328 (±0.0001) (+1.39%) |
0.397638 (±0.0000) (+1.72%) |
0.396242 (±0.0000) (+1.37%) |
Internet |
0.220206 (±0.0005) (+5.49%) |
0.208748 (±0.0011) |
0.236269 (±0.0000) (+13.18%) |
0.223154 (±0.0005) (+6.90%) |
0.234678 (±0.0000) (+12.42%) |
0.225323 (±0.0002) (+7.94%) |
Kdd98 |
0.194794 (±0.0001) (+0.06%) |
0.194668 (±0.0001) |
0.198369 (±0.0000) (+1.90%) |
0.195759 (±0.0001) (+0.56%) |
0.197949 (±0.0000) (+1.69%) |
0.195677 (±0.0000) (+0.52%) |
Kddchurn |
0.231935 (±0.0004) (+0.28%) |
0.231289 (±0.0002) |
0.235649 (±0.0000) (+1.88%) |
0.232049 (±0.0001) (+0.33%) |
0.233693 (±0.0000) (+1.04%) |
0.233123 (±0.0001) (+0.79%) |
Kick |
0.284912 (±0.0003) (+0.04%) |
0.284793 (±0.0002) |
0.298774 (±0.0000) (+4.91%) |
0.295660 (±0.0000) (+3.82%) |
0.298161 (±0.0000) (+4.69%) |
0.294647 (±0.0000) (+3.46%) |
Upsel |
0.166742 (±0.0002) (+0.37%) |
0.166128 (±0.0002) |
0.171071 (±0.0000) (+2.98%) |
0.166818 (±0.0000) (+0.42%) |
0.168732 (±0.0000) (+1.57%) |
0.166322 (±0.0001) (+0.12%) |
CatBoost Installation
CatBoost is an open-source library that does not comes pre-installed with Python, so before using CatBoost we must install it in our local system.
For installing CatBoost in Python
pip install catboost
For Installing CatBoost In R
install.packages("catboost")
Python Implementation of CatBoost
We will use Python to apply CatBoost to Machine learning project problems. The dataset for the project data can be found here. In this problem, we are given a dataset containing 3 species of flowers and the features of these flowers such as sepal length, sepal width, petal length, and petal width, and we have to classify the flowers into these species.
Importing libraries For CatBoost
After installing CatBoost in our local system, We will import it along with other Python necessary libraries that is needed for this project.
Python3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings( "ignore" )
|
Reading And Describing The Dataset
After, importing the libraries we will load our dataset using the pandas read_csv method as:
Python3
data = pd.read_csv( "Iris.csv" )
print (data.shape)
|
Output:
(150, 6)
Our dataset has 150 rows and 6 columns. Let’s explore the dataset content using the head() method as follows:
Output:
|
Id |
SepalLengthCm |
SepalWidthCm |
PetalLengthCm |
PetalWidthCm |
Species |
0 |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
1 |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
2 |
3 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
3 |
4 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
4 |
5 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
Dropping ID Column and Separating Target Variable from The Dataset
The first column is the Id column which has no relevance to flowers so, we will drop it using drop() function. The Species column is our target feature and tells us about the species the flowers belong to. We will separate it using pandas iloc slicing.
Python3
data = data.drop( 'Id' , axis = 1 )
X = data.iloc[:, : - 1 ]
y = data.iloc[:, - 1 ]
print ("Shape of X is % s and shape \
of y is % s" % (X.shape, y.shape))
|
Shape of X is (150, 4) and shape of y is (150,)
Unique Values in Our Dependent Variable
Since this is a classification task we may want to determine the total number of unique categories in our dependent variable.
Python3
total_classes = y.nunique()
print ( "Number of unique species in dataset are: " ,total_classes)
|
Output :
Number of unique species in dataset are: 3
There are 3 unique classes in our dependent variable, We may want to see the count of these unique classes to check the balance in our dataset.
Python3
distribution = y.value_counts()
print (distribution)
|
Output:
Iris-virginica 50
Iris-setosa 50
Iris-versicolor 50
Name: Species, dtype: int64
Let’s dig deep into our dataset, and we can see in the above image that our dataset contains 3 classes into which our flowers are distributed also since we have 150 samples all three species have an equal number of samples in the dataset, so we have no class imbalance.
Splitting The Dataset
Now, we will split the dataset for training and validation purposes, the validation set is 25% of the total dataset. For dividing the dataset into training and testing we will use train_test_split method from the sklearn model selection.
Python3
X_train, X_val, Y_train, Y_val = train_test_split(
X, y, test_size = 0.25 , random_state = 28 )
|
Applying CatBoost to The Data
Python3
params = { 'learning_rate' : 0.1 , 'depth' : 6 ,\
'l2_leaf_reg' : 3 , 'iterations' : 100 }
model = CatBoostClassifier( * * params)
model.fit(X_train, Y_train)
|
Accuracy Of the CatBoost Model
Python3
y_pred = model.predict(X_val)
accuracy = (y_pred = = np.array(Y_val)).mean()
print ( "Validation Accuracy:" , accuracy)
|
Output:
Validation Accuracy: 0.33518005540166207
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
12 May, 2023
Like Article
Save Article