CatBoost Cross-Validation and Hyperparameter Tuning

Last Updated : 11 Nov, 2023

CatBoost is a powerful gradient-boosting algorithm of machine learning that is very popular for its effective capability to handle categorial features of both classification and regression tasks. To maximize the potential of CatBoost, it’s essential to fine-tune its hyperparameters which can be done by Cross-validation. Cross-validation is a crucial technique that allows data scientists and machine learning practitioners to rigorously assess the model’s performance under different parameter configuration sets and select the most optimal hyperparameters. In this article, we are going to discuss how we can tune the hyper-parameters of CatBoost using cross-validation.

What is CatBoost

CatBoost or Categorical Boosting is a machine learning algorithm that was developed by Yandex, a Russian multinational IT company. This special boosting algorithm is based on the gradient boosting framework and is designed to handle categorical features more effectively than traditional gradient boosting algorithms. CatBoost incorporates techniques like ordered boosting, oblivious trees, and advanced handling of categorical variables to achieve high performance with minimal hyperparameter tuning. But for real-world datasets, it is required to perform hyperparameter tuning by which we can achieve optimized model training overhead and accurate predictions. In this article, we are going to tune its hyperparameters using Cross-validation.

What is Cross-Validation

Cross-validation is a fundamental technique used in machine learning to assess a model’s performance by mitigating the risk of overfitting and determining how well a model is likely to generalize to unseen data. This process involves several steps dividing the dataset into multiple subsets or folds, then training the model on the training set, and finally evaluating its performance on the remaining validation set. Two common cross-validation methods are k-fold cross-validation and stratified k-fold cross-validation, The stratified CV is going to be used in this article. There are some key-benefits of cross-validation listed below–>Robust Performance Assessment: Cross-validation provides a more accurate estimate of a model’s performance because it assesses its ability to generalize to different data subsets which helps to detect issues like overfitting.

Hyperparameter Tuning: Cross-validation is important for hyperparameter tuning. By evaluating model performance across various parameter combinations, data scientists can identify the best hyperparameters which results optimal model performance and accurate prediction.
Effective Use of Data: Cross-validation ensures that the entire dataset is used for both training and validation which maximizes the utility of the available data.

Why to perform Hyperparameter tuning

Hyperparameter tuning is the process of systematically searching for the best hyperparameter values for a machine learning model which has several key-importance listed below:

Improved Model Performance: The right set of hyperparameters can significantly enhance a model’s performance which leads to better accuracy and generalization to new data.
Reduction of Overfitting: Carefully chosen hyperparameters can prevent overfitting where the model learns to fit the training data too closely and performs poorly on unseen data which leads to wrong predictions on large datasets.
Increased Robustness: Tuned hyperparameters make a model more resilient to variations in the data and different problem scenarios which ensures it to remain effective in various situations by reducing computational resources and training time.
Enhanced Interpretability: Some hyperparameters can influence the interpretability of the model and tuning them can make the model’s output more understandable and actionable which leads to accurate predictions with optimized model training.

Implementation of Cross-validation for hyperparameter tuning in Catboost

Installing required module

At first, we need to install CatBoost module to our runtime.

!pip install catboost

Importing required libraries

Now we will import all required Python libraries like Pandas, NumPy, Matplotlib, Seaborn, SKlearn etc.

Python3

import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt

Loading Dataset

Now we will load the Titanic dataset and select relevant features for this implementation and create a list of categorial features which will be feed to the model later on.

Python3

# Load the Titanic dataset
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
df = pd.read_csv(url)
 
# Select relevant features and target variable
X = df[['Pclass', 'Sex', 'Age', 'Fare']]
y = df['Survived']
 
# define categorial features of dataset
cat_features = ['Pclass', 'Sex']  # List of categorical features

This code creates a DataFrame named df and loads the Titanic dataset from a specified URL. Following that, it chooses particular features, such as “Pclass,” “Sex,” “Age,” and “Fare,” and assigns them to the variable X. The variable y contains the desired variable “Survived.” Furthermore, ‘Pclass’ and ‘Sex’ are two categories of features that the code defines as cat_features. Machine learning models may make use of these categorical features to handle categorical data in an effective manner.

Exploratory Data Analysis

Exploratory Data Analysis(EDA) helps us to gain deeper insights about the dataset.Exploratory Data Analysis (EDA) is a critical initial step in data analysis to summarize the main characteristics of a dataset, often using visual methods. It involves uncovering patterns, understanding underlying structures, identifying anomalies, and testing assumptions within the dataset.

Correlation Matrix

Visualizing correlation matrix will help us to understand how the features of the dataset is correlated to each other.

Python3

# Visualize correlation matrix
correlation_matrix = df.corr(numeric_only=True)
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

Output:

Correlation Matrix

The corr method is used in this code to compute the correlation matrix of the numerical characteristics in the Titanic dataset. After that, a heatmap is made to show the correlations using the Seaborn library. The correlation values are shown on the heatmap with the annot=True parameter, and the color map “coolwarm” is used to visualize the data.

Cross-Validation Settings

Python3

# Define a range of hyperparameter values to search through
iterations_values = [100, 200, 300]
depth_values = [6, 8, 10]
learning_rate_values = [0.1, 0.05, 0.01]
 
best_score = 0  # Initialize the best score
best_params = {}  # Initialize the best hyperparameters
 
# Define cross-validation settings
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Now we will initialize the Stratified K-fold Cross-validation. We will perform cross-validation on three hyperparameters of the CatBoost model which are discussed below:

iterations: This parameter is used to specify the number of boosting iterations which corresponds to the number of decision trees to be built. It controls the complexity of the model.
depth: It defines the depth of each decision tree in the ensemble. A higher depth allows the model to capture more complex relationships present in the data. But it may lead to overfitting if it is set too high.
learning_rate: The learning rate determines the step size for gradient descent during model training. A smaller learning rate can help prevent overfitting but may require more iterations to converge.

This code defines a range of hyperparameter values for a machine learning model. Three lists are specifically made: learning_rate_values with values [0.1, 0.05, 0.01], depth_values with values [6, 8, 10], and iterations_values with values [100, 200, 300]. These lists show various model hyperparameter choices. To maintain track of the best hyperparameter values discovered throughout the search, the code initializes best_score to 0 and best_params as an empty dictionary.

It also configures cross-validation with StratifiedKFold with 5 splits, data shuffles, and a random state of 42. During the hyperparameter tuning process, these values will be used to assess how well the model performs with various combinations of hyperparameters.

Tuning Loop

Python3

# Initialize a list to store tuning progress
tuning_progress = []
 
# Perform hyperparameter tuning with cross-validation
for iterations in iterations_values:
    for depth in depth_values:
        for learning_rate in learning_rate_values:
            # Create a CatBoost model with the current hyperparameters
            model = CatBoostClassifier(iterations=iterations, depth=depth,
                                       learning_rate=learning_rate, cat_features=cat_features, verbose=False)
 
            # Perform cross-validation and get the mean F1 score
            f1_scores = []
            for train_index, val_index in cv.split(X, y):
                X_train, X_val = X.iloc[train_index], X.iloc[val_index]
                y_train, y_val = y.iloc[train_index], y.iloc[val_index]
                model.fit(X_train, y_train)
                y_pred = model.predict(X_val)
                f1 = f1_score(y_val, y_pred)
                f1_scores.append(f1)
 
            mean_f1 = sum(f1_scores) / len(f1_scores)
 
            # Update the best hyperparameters if a better score is found
            if mean_f1 > best_score:
                best_score = mean_f1
                best_params = {
                    'iterations': iterations,
                    'depth': depth,
                    'learning_rate': learning_rate
                }
 
            # Append the progress to the list
            tuning_progress.append({
                'Iterations': iterations,
                'Depth': depth,
                'Learning Rate': learning_rate,
                'F1 Score': mean_f1
            })

This code uses cross-validation to optimize the hyperparameter combinations for a CatBoostClassifier. Here is a thorough description of the code:
In order to record the progress of hyperparameter tuning, including the F1 scores for various combinations of hyperparameters, tuning_progress is first created as an empty list. The specified hyperparameter values are iterated through via a nested loop: depth loops over values [6, 8, 10]. iterations loop through values [100, 200, 300]. learning_rate loop with values [0.05, 0.01, 0.1].
A CatBoostClassifier model is built using the current hyperparameters inside the stacked loops. It provides the tree depth, learning rate, number of iterations, and categorical characteristics defined in cat_features. To prevent overproduction, the model is configured to be non-verbose.
The F1 score is used in cross-validation to assess the model’s performance. The previously defined StratifiedKFold settings are used to divide the data into training and validation sets. The model is trained on the training data (X_train and y_train) for each cross-validation fold, and it is then used to predict the validation data (X_val). Every fold’s F1 score is determined and kept in the f1_scores list.
By averaging the F1 values from each cross-validation fold, the mean F1 score is calculated. This shows how well the model performs with the available hyperparameters.
The hyperparameters and the score are changed in best_params and best_score, respectively, if the mean F1 score is better (higher) than the best_score observed thus far.
The list tuning_progress contains the current catboost hyperparameters (iterations, depth, and learning rate) along with the corresponding mean F1 score, indicating the progress of the hyperparameter tuning.

Visualization of Tuning Progress

Now we will visualize the tuning progress and extract the best set values of hyperparameters.

Python3

# Print the tuning progress in a table
print(tabulate(tuning_progress, headers='keys', tablefmt='pretty'))
 
# Print the best hyperparameters and F1 score
print("Best Hyperparameters:", best_params)

Output:

+------------+-------+---------------+--------------------+
| Iterations | Depth | Learning Rate |      F1 Score      |
+------------+-------+---------------+--------------------+
|    100     |   6   |      0.1      | 0.7322687453324533 |
|    100     |   6   |     0.05      | 0.7182202485927607 |
|    100     |   6   |     0.01      | 0.7158252814552029 |
|    100     |   8   |      0.1      | 0.740413070492519  |
|    100     |   8   |     0.05      | 0.7273177220983926 |
|    100     |   8   |     0.01      | 0.7130408178567857 |
|    100     |  10   |      0.1      | 0.7421390513453284 |
|    100     |  10   |     0.05      | 0.7227720134780492 |
|    100     |  10   |     0.01      | 0.714975371850372  |
|    200     |   6   |      0.1      | 0.7691377296011834 |
|    200     |   6   |     0.05      | 0.7455641270757373 |
|    200     |   6   |     0.01      | 0.7152601973003904 |
|    200     |   8   |      0.1      | 0.7721211161834263 |
|    200     |   8   |     0.05      | 0.7562464661771585 |
|    200     |   8   |     0.01      | 0.726428330128534  |
|    200     |  10   |      0.1      | 0.782297131444335  |
|    200     |  10   |     0.05      | 0.7702156025135478 |
|    200     |  10   |     0.01      | 0.723850994293308  |
|    300     |   6   |      0.1      | 0.7669845192385327 |
|    300     |   6   |     0.05      | 0.7658800713486457 |
|    300     |   6   |     0.01      | 0.721959083713663  |
|    300     |   8   |      0.1      | 0.7743234942561392 |
|    300     |   8   |     0.05      | 0.7687311053890516 |
|    300     |   8   |     0.01      | 0.7276006304501964 |
|    300     |  10   |      0.1      | 0.7742531262139627 |
|    300     |  10   |     0.05      | 0.7655710411482259 |
|    300     |  10   |     0.01      | 0.7364337989615153 |
+------------+-------+---------------+--------------------+
Best Hyperparameters: {'iterations': 200, 'depth': 10, 'learning_rate': 0.1}

This code uses the tabulate function to first show the hyperparameter adjustment progress in a table format. To aid in the visualisation of the tuning process, this table displays various combinations of hyperparameters together with the related F1 scores.

The accompanying F1 score and the optimal hyperparameters (best_params) are then printed. The configuration of the CatBoost model that yields the highest F1 score during tuning is represented by these ideal hyperparameters. This data is essential for comprehending the model’s functionality and directing future model implementation.

Evaluation of the best model

We have already extracted the best parameters. Now we will feed them to the model and check its performance.

Python3

# Train the model on best parameters
best_model = CatBoostClassifier(**best_params, cat_features=cat_features, verbose= False)
best_model.fit(X, y)
 
# Make predictions
y_pred = best_model.predict(X)
 
# Calculate accuracy and F1 score for best model
accuracy = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred)
 
print(" Accuracy:", accuracy)
print(" F1 Score:", f1)

Output:

 Accuracy: 0.9267192784667418
 F1 Score: 0.8995363214837713

This function uses the hyperparameters (best_params) that were acquired throughout the hyperparameter tuning procedure to build the optimal CatBoost model. ‘False’ verbosity and the specified category characteristics are used to train the model in a silent manner. Then, using the same dataset for predictions, we fit the best model to the complete dataset.

The accuracy and F1 score for this top model are then computed and printed. The accuracy measures the percentage of properly predicted instances, while the F1 score provides an overall evaluation of the model’s performance in classification tasks by balancing precision and recall.

Conclusion

We can conclude that, hyperparameter tuning is a very important task to achieve higher performance of any model. Here, after hyperparameter tuning our model achived a notable 92.67% of accuracy and outstanding 89.95% of F1-score. But there is more little space of improvement. In that case, we can employ more hyperparameters for tuning. Also in real-world large datasets it is crucial to employ as much as hyperparameters to obtain mostly correct set of values of hyperparameters.

Suggest improvement

Cross-validation and Hyperparameter tuning of LightGBM Model

Share your thoughts in the comments

CatBoost Cross-Validation and Hyperparameter Tuning

What is CatBoost

What is Cross-Validation

Why to perform Hyperparameter tuning

Implementation of Cross-validation for hyperparameter tuning in Catboost

Installing required module

Importing required libraries

Python3

Loading Dataset

Python3

Exploratory Data Analysis

Correlation Matrix

Python3

Cross-Validation Settings

Python3

Tuning Loop

Python3

Visualization of Tuning Progress

Python3

Evaluation of the best model

Python3

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?