CatBoost Regularization parameters

CatBoost, developed by Yandex, is a powerful open-source gradient boosting library designed to tackle categorical feature handling and deliver high-performance machine learning models. It stands out for its ability to handle categorical variables natively, without requiring extensive preprocessing. This feature simplifies the workflow and preserves valuable information, making it an attractive choice for real-world applications.

What is Regularization?

Regularization is a technique used in machine learning to prevent models from fitting the training data too closely. Overfitting occurs when a model learns the noise in the training data, which leads to poor generalization of unseen data. Regularization parameters act as constraints on the model’s complexity, discouraging it from fitting the training data too closely.

CatBoost Regularization Parameters

CatBoost offers several regularization parameters, each designed to control a specific aspect of model complexity. Let’s explore some of the most commonly used CatBoost regularization parameters:

1. L2 Regularization (reg_lambda)

L2 regularization, also known as ridge regularization, adds a penalty term to the loss function based on the L2 norm of the model’s weights. This discourages the model from assigning too much importance to any one feature. The reg_lambda parameter controls the strength of this regularization. Higher values lead to stronger regularization.

Here, L₀(θ) is the original loss function, λ is the regularization strength, and ∥θ∥₂² is the squared L2 norm of the model parameters.

2. L1 Regularization (reg_alpha)

L1 regularization, also known as lasso regularization, adds a penalty term based on the L1 norm of the model’s weights. It encourages feature selection by pushing some weights to exactly zero. The reg_alpha parameter controls the strength of this regularization.

Here, L₀(θ) is the original loss function, λ is the regularization strength, and ∥θ∥₂² is the squared L2 norm of the model parameters.

3. Max Depth (max_depth)

The max_depth parameter controls the maximum depth of trees in the CatBoost ensemble. Limiting tree depth is a form of regularization as it prevents the model from creating overly complex trees that can fit noise in the data.

Here, T(x) is the tree, d is depth, R_k represents the regions defined by the decision nodes, and f_k are the values associated with each region.

4. Min Child Samples (min_child_samples)

This parameter sets the minimum number of samples required to split a node. Increasing min_child_samples can prevent the model from overfitting by ensuring that a node must have a minimum amount of data to be split.

Here, n_min is the specified minimum number of samples.

5. Colsample Bylevel (colsample_bylevel) and Colsample Bytree (colsample_bytree)

These parameters control the fraction of features to consider when building each level of a tree (colsample_bylevel) and each tree in the ensemble (colsample_bytree). Reducing these values can add regularization by making the model less sensitive to individual features.

The colsample_bylevel parameter controls the fraction of features to be randomly chosen for each level in every tree.
The colsample_bytree parameter controls the fraction of features to be randomly chosen for each tree.

6. rsm (Random Selection Rate)

It specifies the fraction of features to be randomly chosen for each tree. Introducing randomness in feature selection is a form of regularization. It prevents the model from relying too heavily on specific features, enhancing generalization by making the model more robust to variations in the dataset.

The random selection rate, denoted by p, specifies the fraction of features to be randomly chosen for each tree.

Here, m is the total number of features, and m’ is the number of features randomly selected for a particular tree.

7. leaf_estimation_method

This parameter determines the method used to calculate values in leaves. Setting it to ‘Newton’ enables the use of Newton-Raphson’s method for leaf value calculation, which can provide better generalization and regularization.

Here, f_k is the leaf value, g_k is the first-order gradient, H_k is the second-order Hessian, and λ is the regularization term.

Choosing the Right Regularization Parameters

Choosing the appropriate regularization parameters for your CatBoost model requires a balance between bias and variance. If the model is too complex (low regularization), it might fit the noise in the training data, leading to poor generalization. On the other hand, if the model is too simple (high regularization), it might underfit and fail to capture important patterns in the data.

A common practice is to use techniques like cross-validation and grid search to find the optimal combination of hyperparameters, including regularization parameters. Cross-validation helps in estimating how well a model will generalize to an independent dataset, and grid search exhaustively tries all possible hyperparameter combinations to find the best set of parameters.

Implementation of Regularization Parameters in CatBoost

Let’s implement CatBoost with various regularization parameters in Python.

Importing Libraries

Python3

import pandas as pd

from catboost import CatBoostClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from catboost import Pool, CatBoostClassifier

pandas as pd: Used for data manipulation.
CatBoostClassifier: From catboost, this function is used for building the machine learning model.
train_test_split: From Scikit-Learn, this function is used to split the dataset into training and testing sets.
accuracy_score: This function from Scikit-Learn computes the accuracy classification score, which measures the accuracy of the classification model.

Dataset Loading and Splitting

We load the dataset from a CSV File for Diabetes Prediction. The dataset is split into 8 features (BMI, insulin level, age, etc.) and the target variable (Outcome whether patient has diabetes or not). We further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.

Python3

# Load the dataset

df = pd.read_csv('diabetes.csv')
 
# Separate features and target variable

X = df.drop('Outcome', axis=1) 

y = df['Outcome']
 
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Creating CatBoost Pools

Python3

# Creating a CatBoost Pool for training and testing data

train_pool = Pool(data=X_train, label=y_train)

test_pool = Pool(data=X_test, label=y_test)

CatBoost operates on a data structure called Pool. Here, we create train_pool and test_pool to efficiently handle the training and testing data.

Defining CatBoost Parameters

Python3

# Defining CatBoost parameters with regularization

params = {

    'depth': 6,                   # Depth of the trees

    'learning_rate': 0.1,          # Learning rate of the model

    'l2_leaf_reg': 3,              # L2 regularization term on weights

    'rsm': 0.8,                   # Random Selection Rate (regularization by introducing randomness)

    'iterations': 100,            # Number of boosting iterations

    'loss_function': 'MultiClass', # Loss function for multi-class classification

    'eval_metric': 'Accuracy',    # Evaluation metric

    'random_seed': 42             # Random seed for reproducibility
}

We define a dictionary params containing parameters for CatBoost.

depth: It controls the maximum depth of the trees in the ensemble.
learning_rate: Step size shrinkage used to prevent overfitting. Lower values make the model more robust but require more boosting rounds.
l2_leaf_reg: L2 regularization adds a penalty term to the loss function based on the square of the weights.
rsm: It specifies the fraction of features to be randomly chosen for each tree.
iterations: It represents the number of trees added to the model. Increasing the number of iterations allows the model to learn more complex patterns in the data.
loss_function: It specifies the loss function used during the training process. For multi-class classification tasks, ‘MultiClass’ is an appropriate choice as it optimizes the model for multi-class classification problems.
eval_metric: It defines the metric used to evaluate the model’s performance during training. ‘Accuracy’ is used in this case, which measures the proportion of correctly classified instances.
random_seed: Setting a random seed ensures that the random processes in the algorithm are reproducible. It means that if you run the same code with the same random seed, you will get the same results, making experiments reproducible

Training the Model

Python3

# Training the CatBoost model

model = CatBoostClassifier(**params)

model.fit(train_pool, eval_set=test_pool, verbose=50)

A CatBoostClassifier is instantiated with the specified parameters, and it’s trained using the fit() method. The train_pool is used as the training data, and eval_set is set to test_pool for validation during training. verbose=50 specifies that training progress will be printed every 50 iterations.

of

Python3

# Making predictions on the test data

predictions = model.predict(test_pool)
 
# Calculating accuracy

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: {:.2f}%".format(accuracy * 100))

Output:

Accuracy: 78.57%

The trained model is used to make predictions on the test data. The predictions are then compared with the actual labels, and accuracy is calculated using accuracy_score().

Implementation of CatBoost without using Regularization Parameters

Python3

# CatBoost model without regularization

model_no_reg = CatBoostClassifier()
model_no_reg.fit(X_train, y_train)

accuracy_no_reg = accuracy_score(y_test, model_no_reg.predict(X_test))

print("Accuracy without regularization: {:.2f}%".format(accuracy_no_reg*100))

Output:

Accuracy without regularization: 75.32%

In this case, accuracy is 75.32%, indicating that 75.32% of the test samples were classified correctly which is much less than what we had obtained using CatBoost with Regularization parameters.

Advantages of Regularization in CatBoost

Regularization techniques in CatBoost offer several advantages that contribute to the development of accurate and robust predictive models. Here are the key advantages of regularization in CatBoost:

Preventing Overfitting: Overfitting occurs when a model learns the training data too well, including its noise and outliers. Regularization introduces a penalty term for large weights, discouraging overly complex models that may fit the noise in the data.
Handling Noisy Data: Real-world datasets often contain noisy or irrelevant features. Regularization helps the model focus on the most informative features while ignoring noisy or irrelevant ones. This can lead to a simpler and more interpretable model.
Improving Generalization: By controlling the complexity of the model, regularization techniques improve the model’s ability to generalize patterns to unseen data. Regularized models tend to perform better on new, unseen data, making them more reliable for real-world applications.
Reducing the Risk of Outliers: Regularization techniques can mitigate the impact of outliers by limiting the influence of extreme values on the model’s parameters. This ensures that the model does not become overly sensitive to outliers, making it more robust to variations in the data.

Conclusion

CatBoost’s regularization parameters are essential tools for preventing overfitting and building more robust machine learning models. By striking the right balance between bias and variance, you can create models that are not only accurate on the training data but also perform well on unseen data, making them valuable tools for real-world applications.

Article Tags :

AI-ML-DS

Geeks Premier League

Machine Learning

CatBoost

Geeks Premier League 2023