CatBoost Parameters and Hyperparameters

For gradient boosting on decision trees, CatBoost is a well-liked open-source toolkit. It was created by Yandex and may be applied to a range of machine-learning issues, including classification, regression, ranking, and more. Compared to other boosting libraries, CatBoost has a number of benefits, including:

It can handle categorical features automatically, without the need for encoding or preprocessing.
It can reduce overfitting by using a novel gradient-boosting scheme and regularization techniques.
It can achieve high performance and scalability by using efficient implementations for CPU and GPU.

In this post, we will concentrate on the CatBoost parameters and hyperparameters, which are the variables that regulate the algorithm’s operation and performance. We will describe them, how they impact the model, and how to fine-tune them for the best outcomes.

Catboost Parameters and Hyperparameters

CatBoost Parameters

The model’s internal settings that it learned during training are known as the parameters. For instance, the split points and leaf values in a decision tree are parameters. You may modify a number of CatBoost’s parameters to make the training process unique. Let’s examine several crucial CatBoost settings and their functions:

iterations: This parameter specifies the number of boosting iterations (trees) to be used during training.
learning_rate: It controls the step size at each iteration while moving toward a minimum of the loss function.
depth: Determines the maximum depth of the individual decision trees in the ensemble.
l2_leaf_reg: Regularization term that prevents overfitting by penalizing large parameter values.
cat_features: An array of indices indicating which features are categorical. CatBoost automatically handles categorical features, but you can provide additional guidance with this parameter.
loss_function: Specifies the loss function to be optimized during training. For regression tasks, you might use ‘RMSE,’ while for classification, ‘Logloss’ is common.

CatBoost Hyperparameters

Hyperparameters: As a machine learning practitioner, you must provide these parameters before to training. They have control over a number of training-related variables, including decision tree depth and learning rate. For the model to perform well, suitable hyperparameter selection is essential.

For configuring and fine-tuning hyperparameters, CatBoost offers a versatile interface that may be broken down into many categories:

Common hyperparameters: These are the basic hyperparameters that are applicable to any machine learning problem, such as the loss function, the learning rate, or the random seed.
Bootstrap hyperparameters: These are the hyperparameters that control the sampling of the data for each tree, such as the bootstrap type or the subsample rate.
Tree structure hyperparameters: These are the hyperparameters that control the shape and size of each tree, such as the depth, the number of leaves, or the minimum samples in a leaf.
Feature importance hyperparameters: These are the hyperparameters that control how features are selected and split for each tree, such as the feature border type or the random strength.
Regularization hyperparameters: These are the hyperparameters that control how much complexity is penalized in the model, such as the L2 regularization or the leaf estimation method.
Overfitting detector hyperparameters: These are the hyperparameters that control how to stop the training when overfitting occurs, such as the eval metric or the use best model option.

Some of the common hyperparameters used for tuning are as follows:

Learning rate: This feature reduces the gradient step. The longer the training process will take overall, the fewer iterations are needed the smaller the value.
Tree Depth: Each decision tree in the ensemble has a maximum depth that is specified by the depth. Deeper trees can catch more complicated patterns, but if the threshold is set too high, they might overfit.
Bagging temperature: It regulates how randomly samples are chosen for training. Samples become more deterministic when the value is higher (> 1), while they become random when the value is smaller (for example, 1), which may help generalization.
Border count: It controls the most splits that are permitted for numerical features, which affects model complexity and training efficiency. Lower values speed up training but could restrict modeling, while higher values capture finer patterns but increase processing.
L2 regularization: It adds a penalty term to the loss function during training to discourage high weight values and encourage a more basic model, aiding in the prevention of overfitting. Higher values impose stronger regularization, which is controlled by the “reg_lambda” hyperparameter.

Hyperparameter Tuning

The process of selecting the ideal collection of hyperparameters for a certain issue and dataset is known as hyperparameter tuning. For hyperparameter tuning, a variety of techniques and tools are available, including grid search, random search, Bayesian optimization, and Optuna. The general procedures for tweaking hyperparameters are:

Define a search space: This is a range or a list of possible values for each hyperparameter that you want to tune.
Define an objective function: This is a function that evaluates how well a model performs on a validation set with a given set of hyperparameters.
Define a search strategy: This is a method that decides how to explore the search space and find the optimal set of hyperparameters.
Run the search: This is where you execute the search strategy and collect the results.
Analyze the results: This is where you compare and visualize the performance of different sets of hyperparameters and choose the best one.

Implementation

CatBoost is a gradient boosting library known for its effectiveness in handling categorical features and its impressive out-of-the-box performance. In this guide, we will walk you through the process of implementing a CatBoost model for multi-class classification using the Iris dataset. You can found dataset at : Iris

Understanding the Iris Dataset

The Setosa, Versicolor, and Virginica species of iris blooms are featured in the Iris dataset, a standard dataset in machine learning. It makes a great case study for multi-class categorization.

Installing Packages

!pip install catboost

Importing Necessary Libraries

Let’s begin by importing the Python libraries we’ll need for this project:

Python

#importing libraries 

import pandas as pd

from sklearn.model_selection import train_test_split

from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

Loading and Preprocessing the Dataset

Load the Iris dataset into a Pandas DataFrame and preprocess it. Ensure there are no missing values and that categorical variables are appropriately encoded.

Python

# Load the Iris dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

data = pd.read_csv(url, names=column_names)
 
# Split the data into features and target variable

X = data.drop('class', axis=1)

y = data['class']

This code labels the columns in the Iris dataset and loads it from a specified URL. When the ‘class’ column is removed from the dataset, the dataset is then divided into features (X) and the target variable (Y).

Splitting Data into Training and Testing Sets

Split your data into training and testing sets to evaluate the model’s performance:

Python

#splitting data into train and test sets 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are dividing the dataset into training and testing sets, this program employs the train_test_split function from Scikit-Learn. With the help of the test_size and random_state parameters, 20% of the data will be set aside for testing, and fixing the random seed in random_state guarantees reproducibility. X_train, X_test for features and y_train, y_test for target values are the sets that are produced.

Building and Training the CatBoost Model

Initialize and train the CatBoost model:

Python

# Initialize the CatBoost classifier

model = CatBoostClassifier(iterations=500, learning_rate=0.1, depth=6, loss_function='MultiClass', random_state=42)
 
# Train the model on the training data
model.fit(X_train, y_train)

Output:

0:    learn: 0.9959553    total: 57.1ms    remaining: 28.5s
1:    learn: 0.8977899    total: 57.9ms    remaining: 14.4s
2:    learn: 0.8160208    total: 58.6ms    remaining: 9.7s
3:    learn: 0.7441595    total: 61.6ms    remaining: 7.64s
4:    learn: 0.6808458    total: 70.6ms    remaining: 6.99s
5:    learn: 0.6365295    total: 71.9ms    remaining: 5.92s
6:    learn: 0.5888323    total: 73ms    remaining: 5.14s
7:    learn: 0.5414839    total: 73.8ms    remaining: 4.54s
8:    learn: 0.5021250    total: 75.2ms    remaining: 4.1s
9:    learn: 0.4721895    total: 81.4ms    remaining: 3.99s
...
491:    learn: 0.0071053    total: 156ms    remaining: 2.53ms
492:    learn: 0.0070795    total: 156ms    remaining: 2.21ms
493:    learn: 0.0070619    total: 156ms    remaining: 1.9ms
494:    learn: 0.0070507    total: 156ms    remaining: 1.58ms
495:    learn: 0.0070360    total: 157ms    remaining: 1.26ms
496:    learn: 0.0070230    total: 157ms    remaining: 946us
497:    learn: 0.0070069    total: 157ms    remaining: 630us
498:    learn: 0.0069942    total: 157ms    remaining: 314us
499:    learn: 0.0069780    total: 157ms    remaining: 0us

Here, this code sets up a CatBoost classifier with the hyperparameters iterations, learning rate, depth of the tree, and loss function. In order to learn a multiclass classification model, the model is then trained on the training data (X_train and y_train) using these settings.

Evaluating Model Performance

Evaluate the model’s performance on the test data:

Python

# Make predictions on the test data

y_pred = model.predict(X_test)
 
# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")

Output:

Accuracy: 100.00%

In this code, predictions are made on the test data (X_test) using the trained CatBoost classifier (model). The classification accuracy is then determined by comparing the genuine labels (y_test) with the predicted labels (y_pred), and the accuracy percentage is printed.

If you see that a machine learning model achieves 100% accuracy, particularly on a real-world dataset like the Iris dataset, this is exceedingly exceptional and may point to a problem with your model or the criteria you’re using to evaluate it. Here are some typical explanations for such great accuracy as well as strategies to expand your investigation:

Data Leakage: Ensure that you have properly split your dataset into training and testing sets. Data leakage can occur if your model has seen some or all of the test data during training.
Overfitting: The model has effectively memorized the training data and is unable to generalize to new, unknown data, which might be a symptom of overfitting. By simplifying the model (for example, by decreasing the tree’s depth) or adding more training data, you can combat overfitting.
Evaluation Metric: Make sure you are use the appropriate assessment metric for your issue. Accuracy may not always be the best statistic for classification tasks, particularly if the classes are unbalanced.

Therefore, we undertake hyperparameter tuning and cross-validation. Use cross-validation to obtain a more reliable assessment of your model’s performance rather than a single train-test split. This will make it easier to determine whether the high accuracy is consistent across various data subsets.

Hyperparameter Tuning

Consider using cross-validation to assess your model’s robustness:

Python

from catboost import CatBoostClassifier, Pool, cv
 
# Create a CatBoost Pool

catboost_pool = Pool(X, label=y)
 
# Define the parameters for the CatBoost model

params = {

    'iterations': 1000,

    'learning_rate': 0.01,

    'depth': 3,

    'loss_function': 'MultiClass',

    'random_state': 42,
}
 
# Perform cross-validation using the cv function from CatBoost

cv_results, cv_model = cv(

    pool=catboost_pool,

    params=params,

    # Specify the number of folds for cross-validation

    fold_count=5,  

    # Print information during training

    verbose=False, 

    return_models=True
)

Output:

Training on fold [0/5]

bestTest = 0.1903599557
bestIteration = 723

Training on fold [1/5]

bestTest = 0.2019080832
bestIteration = 540

Training on fold [2/5]

bestTest = 0.09307095973
bestIteration = 983

Training on fold [3/5]

bestTest = 0.1257137299
bestIteration = 893

Training on fold [4/5]

bestTest = 0.09728240085
bestIteration = 996

Print the Result:

Python3

print(cv_results.head())

Output:

   iterations  test-MultiClass-mean  test-MultiClass-std  \
0           0              1.086702             0.001203   
1           1              1.074234             0.001518   
2           2              1.060712             0.001777   
3           3              1.050879             0.002378   
4           4              1.039454             0.001931   

   train-MultiClass-mean  train-MultiClass-std  
0               1.086469              0.000294  
1               1.074242              0.001409  
2               1.060602              0.001765  
3               1.050635              0.001235  
4               1.039139              0.001284

The code applies cross-validation to a CatBoostClassifier model using the CatBoost library. It begins by constructing a CatBoost Pool, a data structure that manages the dataset effectively. The depth of the trees, learning rate, loss function (set to “MultiClass” for multiclass classification), and a random seed for repeatability are among the parameters for the CatBoost model that are specified. The cv function from CatBoost is used to carry out the cross-validation. In order to print training data, it specifies the cross-validation fold count (fold_count=5) and asks for verbose output. After cross-validation, the code pulls the names of the metrics from the results and chooses the relevant metric (in this example, the first metric on the list) to compute the mean loss. As a result, the mean loss expressed as a percentage is printed. The CatBoost model’s performance is assessed via cross-validation with the aid of this code, which also offers information on the model’s average loss over various folds.

Python3

# Check the available metric names in the cross-validation results

available_metrics = [metric for metric in cv_results.columns 

                     if metric.startswith('test-')]

print("Available Metrics:", available_metrics)
 
# Choose the appropriate metric for mean accuracy and extract it
# You may need to choose the correct metric based on your task

mean_loss = cv_results[available_metrics[0]].iloc[-1]  
 
print(f"Mean Loss: {mean_loss * 100:.2f}%")

Output:

Available Metrics: ['test-MultiClass-mean', 'test-MultiClass-std']
Mean Loss: 14.60%

Evaluate the accuracy for the each model

Let’s evaluate the accuracy of each model using the obtained model from the each fold

Python3

def Accuracy_Score(cv_model,y_test):

    score ={}

    for i, model in enumerate(cv_model):

        # Make predictions on the test data

        y_pred = model.predict(X_test,

                                     prediction_type='Class')

        # Calculate accuracy

        accuracy = accuracy_score(y_test, y_pred)

        score[i+1]=str(accuracy * 100)+'%'

    return score

Accuracy_Score(cv_model,y_test)

Output:

{1: '100.0%', 2: '100.0%', 3: '100.0%', 4: '100.0%', 5: '100.0%'}

Further Improvements

To further enhance your model, consider:

Feature engineering to create informative features.
Exploring different evaluation metrics for classification tasks, especially if dealing with imbalanced data.
Regularization techniques to prevent overfitting.
Visualizations to gain insights into your data and model’s predictions.

Conclusion

The parameters and hyperparameters of CatBoost, their effects on the model’s performance, and tuning techniques have all been covered in this article. A CatBoost model for multi-class classification must be built, tuned, and evaluated. These processes include data preparation, model training, and hyperparameter tuning. For a variety of categorization jobs, you can build reliable and accurate machine learning models by following this tutorial and looking into additional enhancements.

Article Tags :

AI-ML-DS

Geeks Premier League

Machine Learning

CatBoost

Geeks Premier League 2023