Open In App

CatBoost Parameters and Hyperparameters

For gradient boosting on decision trees, CatBoost is a well-liked open-source toolkit. It was created by Yandex and may be applied to a range of machine-learning issues, including classification, regression, ranking, and more. Compared to other boosting libraries, CatBoost has a number of benefits, including:

In this post, we will concentrate on the CatBoost parameters and hyperparameters, which are the variables that regulate the algorithm’s operation and performance. We will describe them, how they impact the model, and how to fine-tune them for the best outcomes.



Catboost Parameters and Hyperparameters

CatBoost Parameters

The model’s internal settings that it learned during training are known as the parameters. For instance, the split points and leaf values in a decision tree are parameters. You may modify a number of CatBoost’s parameters to make the training process unique. Let’s examine several crucial CatBoost settings and their functions:

CatBoost Hyperparameters

Hyperparameters: As a machine learning practitioner, you must provide these parameters before to training. They have control over a number of training-related variables, including decision tree depth and learning rate. For the model to perform well, suitable hyperparameter selection is essential.



For configuring and fine-tuning hyperparameters, CatBoost offers a versatile interface that may be broken down into many categories:

Some of the common hyperparameters used for tuning are as follows:

Hyperparameter Tuning

The process of selecting the ideal collection of hyperparameters for a certain issue and dataset is known as hyperparameter tuning. For hyperparameter tuning, a variety of techniques and tools are available, including grid search, random search, Bayesian optimization, and Optuna. The general procedures for tweaking hyperparameters are:

Implementation

CatBoost is a gradient boosting library known for its effectiveness in handling categorical features and its impressive out-of-the-box performance. In this guide, we will walk you through the process of implementing a CatBoost model for multi-class classification using the Iris dataset. You can found dataset at : Iris

Understanding the Iris Dataset

The Setosa, Versicolor, and Virginica species of iris blooms are featured in the Iris dataset, a standard dataset in machine learning. It makes a great case study for multi-class categorization.

Installing Packages

!pip install catboost

Importing Necessary Libraries

Let’s begin by importing the Python libraries we’ll need for this project:




#importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

Loading and Preprocessing the Dataset

Load the Iris dataset into a Pandas DataFrame and preprocess it. Ensure there are no missing values and that categorical variables are appropriately encoded.




# Load the Iris dataset
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
data = pd.read_csv(url, names=column_names)
 
# Split the data into features and target variable
X = data.drop('class', axis=1)
y = data['class']

This code labels the columns in the Iris dataset and loads it from a specified URL. When the ‘class’ column is removed from the dataset, the dataset is then divided into features (X) and the target variable (Y).

Splitting Data into Training and Testing Sets

Split your data into training and testing sets to evaluate the model’s performance:




#splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are dividing the dataset into training and testing sets, this program employs the train_test_split function from Scikit-Learn. With the help of the test_size and random_state parameters, 20% of the data will be set aside for testing, and fixing the random seed in random_state guarantees reproducibility. X_train, X_test for features and y_train, y_test for target values are the sets that are produced.

Building and Training the CatBoost Model

Initialize and train the CatBoost model:




# Initialize the CatBoost classifier
model = CatBoostClassifier(iterations=500, learning_rate=0.1, depth=6, loss_function='MultiClass', random_state=42)
 
# Train the model on the training data
model.fit(X_train, y_train)

Output:

0:    learn: 0.9959553    total: 57.1ms    remaining: 28.5s
1:    learn: 0.8977899    total: 57.9ms    remaining: 14.4s
2:    learn: 0.8160208    total: 58.6ms    remaining: 9.7s
3:    learn: 0.7441595    total: 61.6ms    remaining: 7.64s
4:    learn: 0.6808458    total: 70.6ms    remaining: 6.99s
5:    learn: 0.6365295    total: 71.9ms    remaining: 5.92s
6:    learn: 0.5888323    total: 73ms    remaining: 5.14s
7:    learn: 0.5414839    total: 73.8ms    remaining: 4.54s
8:    learn: 0.5021250    total: 75.2ms    remaining: 4.1s
9:    learn: 0.4721895    total: 81.4ms    remaining: 3.99s
...
491:    learn: 0.0071053    total: 156ms    remaining: 2.53ms
492:    learn: 0.0070795    total: 156ms    remaining: 2.21ms
493:    learn: 0.0070619    total: 156ms    remaining: 1.9ms
494:    learn: 0.0070507    total: 156ms    remaining: 1.58ms
495:    learn: 0.0070360    total: 157ms    remaining: 1.26ms
496:    learn: 0.0070230    total: 157ms    remaining: 946us
497:    learn: 0.0070069    total: 157ms    remaining: 630us
498:    learn: 0.0069942    total: 157ms    remaining: 314us
499:    learn: 0.0069780    total: 157ms    remaining: 0us

Here, this code sets up a CatBoost classifier with the hyperparameters iterations, learning rate, depth of the tree, and loss function. In order to learn a multiclass classification model, the model is then trained on the training data (X_train and y_train) using these settings.

Evaluating Model Performance

Evaluate the model’s performance on the test data:




# Make predictions on the test data
y_pred = model.predict(X_test)
 
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Output:

Accuracy: 100.00%

In this code, predictions are made on the test data (X_test) using the trained CatBoost classifier (model). The classification accuracy is then determined by comparing the genuine labels (y_test) with the predicted labels (y_pred), and the accuracy percentage is printed.

If you see that a machine learning model achieves 100% accuracy, particularly on a real-world dataset like the Iris dataset, this is exceedingly exceptional and may point to a problem with your model or the criteria you’re using to evaluate it. Here are some typical explanations for such great accuracy as well as strategies to expand your investigation:

Therefore, we undertake hyperparameter tuning and cross-validation. Use cross-validation to obtain a more reliable assessment of your model’s performance rather than a single train-test split. This will make it easier to determine whether the high accuracy is consistent across various data subsets.

Hyperparameter Tuning

Consider using cross-validation to assess your model’s robustness:




from catboost import CatBoostClassifier, Pool, cv
 
# Create a CatBoost Pool
catboost_pool = Pool(X, label=y)
 
# Define the parameters for the CatBoost model
params = {
    'iterations': 1000,
    'learning_rate': 0.01,
    'depth': 3,
    'loss_function': 'MultiClass',
    'random_state': 42,
}
 
# Perform cross-validation using the cv function from CatBoost
cv_results, cv_model = cv(
    pool=catboost_pool,
    params=params,
    # Specify the number of folds for cross-validation
    fold_count=5
    # Print information during training
    verbose=False,
    return_models=True
)

Output:

Training on fold [0/5]

bestTest = 0.1903599557
bestIteration = 723

Training on fold [1/5]

bestTest = 0.2019080832
bestIteration = 540

Training on fold [2/5]

bestTest = 0.09307095973
bestIteration = 983

Training on fold [3/5]

bestTest = 0.1257137299
bestIteration = 893

Training on fold [4/5]

bestTest = 0.09728240085
bestIteration = 996

Print the Result:




print(cv_results.head())

Output:

   iterations  test-MultiClass-mean  test-MultiClass-std  \
0 0 1.086702 0.001203
1 1 1.074234 0.001518
2 2 1.060712 0.001777
3 3 1.050879 0.002378
4 4 1.039454 0.001931

train-MultiClass-mean train-MultiClass-std
0 1.086469 0.000294
1 1.074242 0.001409
2 1.060602 0.001765
3 1.050635 0.001235
4 1.039139 0.001284

The code applies cross-validation to a CatBoostClassifier model using the CatBoost library. It begins by constructing a CatBoost Pool, a data structure that manages the dataset effectively. The depth of the trees, learning rate, loss function (set to “MultiClass” for multiclass classification), and a random seed for repeatability are among the parameters for the CatBoost model that are specified. The cv function from CatBoost is used to carry out the cross-validation. In order to print training data, it specifies the cross-validation fold count (fold_count=5) and asks for verbose output. After cross-validation, the code pulls the names of the metrics from the results and chooses the relevant metric (in this example, the first metric on the list) to compute the mean loss. As a result, the mean loss expressed as a percentage is printed. The CatBoost model’s performance is assessed via cross-validation with the aid of this code, which also offers information on the model’s average loss over various folds.




# Check the available metric names in the cross-validation results
available_metrics = [metric for metric in cv_results.columns
                     if metric.startswith('test-')]
print("Available Metrics:", available_metrics)
 
# Choose the appropriate metric for mean accuracy and extract it
# You may need to choose the correct metric based on your task
mean_loss = cv_results[available_metrics[0]].iloc[-1
 
print(f"Mean Loss: {mean_loss * 100:.2f}%")

Output:

Available Metrics: ['test-MultiClass-mean', 'test-MultiClass-std']
Mean Loss: 14.60%

Evaluate the accuracy for the each model

Let’s evaluate the accuracy of each model using the obtained model from the each fold




def Accuracy_Score(cv_model,y_test):
    score ={}
    for i, model in enumerate(cv_model):
        # Make predictions on the test data
        y_pred = model.predict(X_test,
                                     prediction_type='Class')
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        score[i+1]=str(accuracy * 100)+'%'
         
    return score
         
Accuracy_Score(cv_model,y_test)

Output:

{1: '100.0%', 2: '100.0%', 3: '100.0%', 4: '100.0%', 5: '100.0%'}

Further Improvements

To further enhance your model, consider:

Conclusion

The parameters and hyperparameters of CatBoost, their effects on the model’s performance, and tuning techniques have all been covered in this article. A CatBoost model for multi-class classification must be built, tuned, and evaluated. These processes include data preparation, model training, and hyperparameter tuning. For a variety of categorization jobs, you can build reliable and accurate machine learning models by following this tutorial and looking into additional enhancements.


Article Tags :