Open In App

CatBoost Tree Parameters

Last Updated : 20 Oct, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

CatBoost is a popular gradient-boosting library known for its effectiveness in machine-learning competitions. It is particularly well-suited for tabular data and has several parameters that can be tuned to improve model performance. In this article, we will focus on CatBoost’s tree-related parameters and explore how they influence the model’s behaviour.

CatBoost

CatBoost, short for Categorical Boosting, is a gradient-boosting algorithm developed by Yandex. It is designed to handle categorical features effectively without the need for extensive preprocessing. CatBoost is known for its robustness, speed, and competitive performance in a wide range of machine-learning tasks.CatBoost is a gradient-boosting algorithm specifically designed for categorical feature support.

Working of CatBoost

The Train Using AutoML tool employs CatBoost, a supervised machine learning technique, which uses decision trees for regression and classification. CatBoost uses gradient boosting (the Boost) and works with categorical data (the Cat), as its name would imply, and these are its two key characteristics. Gradient boosting is a method where several decision trees are built iteratively. The output of each consecutive tree is enhanced, yielding better outcomes. For a quicker implementation, CatBoost enhances the initial gradient boost technique.

By eliminating the need for pre-processing the data to change categorical string variables into numerical values, one-hot encodings, etc., CatBoost addresses a drawback of existing decision tree-based approaches. This approach can use a mix of category and non-categorical explanatory variables directly, without any preprocessing. The algorithm includes preprocessing. To encode category characteristics, CatBoost employs a technique known as ordered encoding. When using ordered encoding, a value is generated to replace the categorical feature that takes into account the target statistics from all the rows prior to a data point. The usage of symmetric trees is another distinctive feature of CatBoost. As a result, every decision node at every depth level employs the identical split condition.

Tree Parameters in CatBoost

CatBoost provides a variety of parameters that allow you to control the behavior of decision trees. These parameters influence the depth of trees, regularization, and other aspects of the boosting process. Let’s explore some of the most important tree-related parameters:

1. depth (alias: max_depth)

The `depth` parameter in gradient boosting algorithms, including CatBoost, plays a crucial role in controlling the complexity of individual decision trees within the ensemble. It determines the maximum depth that each tree can grow to during the training process. A deeper tree can capture more intricate and detailed patterns in the training data, potentially leading to a better fit. However, it also increases the risk of overfitting, where the model becomes too specialized to the training data and performs poorly on unseen data.

When setting the `depth` parameter, you need to strike a balance between model complexity and generalization. If you set it too high, the model may fit noise in the data, making it less effective for predictions on new data. Conversely, if you set it too low, the model might not capture essential patterns, resulting in underfitting. Therefore, it’s essential to experiment with different `depth` values based on the complexity of your dataset and use techniques like cross-validation to find the optimal depth that achieves the best trade-off between model complexity and generalization performance.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with a custom depth
model = CatBoostClassifier(iterations=500, depth=8)
model.get_params()


Output:

{'iterations': 500, 'depth': 8}

2. learning_rate

The `learning_rate` is a critical hyperparameter in gradient boosting algorithms, including CatBoost, as it controls the step size taken during each iteration of the training process. This step size influences how quickly or slowly the model converges to the optimal solution while minimizing the loss function. A lower learning rate implies smaller steps, which can result in more precise convergence and better performance. However, it also makes the training process slower, as the algorithm takes smaller steps to find the optimal solution.

Choosing an appropriate learning rate is essential, as a too high learning rate might cause the model to overshoot the minimum of the loss function and fail to converge, while a too low learning rate can lead to extremely slow training or getting stuck in suboptimal solutions. It’s common practice to experiment with different learning rates and monitor the training process using techniques like learning rate schedules or early stopping to strike the right balance between training speed and convergence quality.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with a custom learning rate
model = CatBoostClassifier(iterations=500, depth=8, learning_rate=0.1)
model.get_params()


Output:

{'iterations': 500, 'learning_rate': 0.1, 'depth': 8}

3. l2_leaf_reg

The `l2_leaf_reg` parameter in CatBoost is responsible for controlling L2 regularization specifically applied to the leaf values of the decision trees within the ensemble. Regularization is a crucial technique used in machine learning to prevent overfitting, which occurs when a model fits the training data too closely and captures noise rather than general patterns.

In the context of CatBoost, L2 regularization for leaf values adds a penalty term to the loss function during training. This penalty term is proportional to the complexity of the individual trees. By increasing the `l2_leaf_reg` value, you apply stronger regularization to the leaf values, effectively discouraging the trees from becoming overly complex.

When you set a higher `l2_leaf_reg`, you introduce a stronger regularization effect, which can help prevent the model from fitting the training data too closely. This can be especially useful when dealing with noisy or small datasets, as it reduces the risk of the model memorizing noise and producing poor generalization to new, unseen data.

However, it’s essential to strike a balance when tuning this parameter. While stronger regularization can prevent overfitting, setting it too high might result in underfitting, where the model becomes too simple to capture essential patterns in the data. Therefore, it’s advisable to experiment with different `l2_leaf_reg` values and use techniques like cross-validation to find the optimal regularization strength for your specific dataset and problem.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with a custom l2_leaf_reg
model = CatBoostClassifier(iterations=500
                           depth=8,
                           l2_leaf_reg=5,
                           learning_rate=0.1)
model.get_params()


Output:

{'iterations': 500, 'learning_rate': 0.1, 'depth': 8, 'l2_leaf_reg': 5}

4. verbose

The verbose parameter in CatBoost determines the level of logging information that is displayed during the training process. It plays a crucial role in controlling the amount of feedback and progress updates you receive while training a CatBoost model. The verbose parameter accepts integer values, and the value you provide corresponds to different levels of verbosity. Here’s what each level typically represents:

  • A low verbose value, such as 0, means minimal or no logging during training. You won’t see progress updates, and the training process will be silent.
  • Increasing the verbose value, e.g., setting it to 1, provides some basic progress information. You might see updates like the number of iterations completed or the training loss.
  • As you further increase the verbose value, such as setting it to 2 or higher, you’ll receive more detailed information about the training process. This can include additional metrics, feature importance, and possibly debugging information.

The choice of the verbose value depends on your preference and the specific needs of your training process. If you want to closely monitor the progress and performance of your model during training, you can use a higher verbose value. However, for large-scale training processes or when you simply want to train the model without too much distraction, you can use a lower verbose value or even set it to 0 for a completely silent training experience. Adjusting the verbose parameter allows you to strike the right balance between information and simplicity during model training.

Here’s a code example with different verbose settings:

Python




import numpy as np
from catboost import CatBoostClassifier, Pool
 
# Sample data
X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
y = np.array([0, 1, 0])
 
# Create a CatBoost Pool for efficient data handling
train_pool = Pool(data=X, label=y, cat_features=[])
 
# Define different verbose settings
verbose_settings = [0, 1, 2, 3]
 
# Train CatBoost models with different verbose settings
for verbose_value in verbose_settings:
    model = CatBoostClassifier(iterations=10,
                               depth=8,
                               l2_leaf_reg=5,
                               learning_rate=0.1,
                               verbose=verbose_value)
    model.fit(train_pool)
 
    print(f"Verbose Setting {verbose_value}:")
    print(f"Number of Trees: {model.tree_count_}")
    print(f"Best Iteration: {model.best_iteration_}")


Output:

Verbose Setting 0:
Number of Trees: 10
Best Iteration: None
0: learn: 0.6883950 total: 130us remaining: 1.17ms
1: learn: 0.6836849 total: 179us remaining: 719us
2: learn: 0.6821274 total: 213us remaining: 498us
3: learn: 0.6774739 total: 260us remaining: 390us
4: learn: 0.6728686 total: 294us remaining: 294us
5: learn: 0.6683083 total: 334us remaining: 223us
6: learn: 0.6637918 total: 370us remaining: 158us
7: learn: 0.6593127 total: 414us remaining: 103us
8: learn: 0.6548787 total: 453us remaining: 50us
9: learn: 0.6504865 total: 485us remaining: 0us
Verbose Setting 1:
Number of Trees: 10
Best Iteration: None
0: learn: 0.6883950 total: 120us remaining: 1.08ms
2: learn: 0.6821274 total: 203us remaining: 475us
4: learn: 0.6728686 total: 272us remaining: 272us
6: learn: 0.6637918 total: 352us remaining: 150us
8: learn: 0.6548787 total: 417us remaining: 46us
9: learn: 0.6504865 total: 452us remaining: 0us
Verbose Setting 2:
Number of Trees: 10
Best Iteration: None
0: learn: 0.6883950 total: 83us remaining: 751us
3: learn: 0.6774739 total: 199us remaining: 299us
6: learn: 0.6637918 total: 306us remaining: 131us
9: learn: 0.6504865 total: 429us remaining: 0us
Verbose Setting 3:
Number of Trees: 10
Best Iteration: None

In this code, we’re using a small sample dataset for simplicity. We then create a CatBoost Pool to handle the data efficiently. We iterate through different verbose settings (0, 1, 2, 3) and train CatBoost models with each setting.

  1. verbose=0: No output during training.
  2. verbose=1: Displays progress information for each tree.
  3. verbose=2: Displays progress information and the iteration results.
  4. verbose=3: Displays detailed information for each iteration.

After training each model, we print the number of trees (model.tree_count_) and the best iteration (model.best_iteration_) achieved during training for each verbose setting.

5. loss_function

The loss_function parameter in CatBoost is a crucial parameter that allows you to specify the loss function to be used during training. The choice of the loss function is a fundamental decision because it determines how the model’s performance is measured and optimized during the training process.

CatBoost supports a variety of loss functions tailored for different types of machine learning tasks. Some commonly used loss functions in CatBoost include:

  1. Logloss (Cross-Entropy Loss): This is the default loss function for classification tasks. It measures the dissimilarity between the predicted probabilities and the actual binary class labels. It is widely used in binary and multiclass classification problems.
  2. RMSE (Root Mean Square Error): This is the default loss function for regression tasks. It measures the average squared difference between the predicted continuous values and the actual target values. It is commonly used in regression problems when the target variable is continuous.
  3. MAE (Mean Absolute Error): Another loss function for regression tasks, MAE measures the average absolute difference between the predicted values and the actual target values. It is robust to outliers and provides a more interpretable measure of error.
  4. Poisson Loss: Suitable for count data and regression tasks where the target variable follows a Poisson distribution.
  5. Quantile Loss: Useful for quantile regression, where you want to predict specific quantiles of the target distribution rather than a single point estimate.

The choice of the loss function depends on the nature of your machine learning problem. For classification, you would typically use ‘Logloss,’ while for regression, ‘RMSE’ or ‘MAE’ are common choices. However, the flexibility to specify different loss functions makes CatBoost adaptable to a wide range of tasks, including those with specialized requirements.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with a custom loss function (e.g., 'MAE' for regression)
model = CatBoostClassifier(iterations=500, loss_function='MAE')
model.get_params()


Output:

{'iterations': 500, 'loss_function': 'MAE'}

6. custom_metric

The custom_metric parameter in CatBoost is a powerful tool that enables you to define and track additional evaluation metrics during the model training process. These custom metrics go beyond the primary loss function and provide valuable insights into the model’s performance from various angles. Here’s how it works:

  • Specify Metric Names: To use custom metrics, you pass a list of metric names as strings to the custom_metric parameter. These metric names correspond to the evaluation criteria you want to track. For example, if you’re working on a classification problem, you might want to track metrics like “AUC” (Area Under the ROC Curve) or “F1 Score” in addition to the default “Logloss” metric.
  • Calculation and Reporting: CatBoost will automatically calculate and report the specified custom metrics during the training process. It evaluates these metrics on both the training and validation datasets, providing insights into how well the model is performing with respect to your chosen criteria.
  • Monitoring Model Performance: By tracking custom metrics, you can closely monitor specific aspects of your model’s performance that are most relevant to your problem domain. This can be especially useful when you have domain-specific requirements or when you want to optimize the model for a particular aspect of performance.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with custom evaluation metrics
model = CatBoostClassifier(iterations=500, custom_metric=['Accuracy', 'AUC'])
model.get_params()


Output:

{'iterations': 500, 'custom_metric': ['Accuracy', 'AUC']}

7. random_seed

The random_seed parameter in CatBoost is a crucial tool for ensuring the reproducibility of your machine learning experiments. When you set a specific random seed value, you’re essentially fixing the initial conditions of the random processes used in CatBoost. Here’s how it works:

  • Reproducibility: Machine learning models often involve elements of randomness, such as random initialization of weights or data shuffling. Without setting a random seed, different runs of your model might yield slightly different results due to these random factors. By setting random_seed to a specific value (an integer), you ensure that these random processes start from the same initial state in every run.
  • Consistency: When you need to compare model performance, debug issues, or share your work with others, having consistent results across different runs is essential. By using the same random seed, you can achieve this consistency and make your experiments more transparent and reliable.

Python




from catboost import CatBoostClassifier
 
# Create a CatBoostClassifier with a specific random seed (e.g., 42)
model = CatBoostClassifier(iterations=500, random_seed=42)
model.get_params()


Output:

{'iterations': 500, 'random_seed': 42}

Implementation with CatBoostClassifier using various parameters on iris dataset

Import the Required Libraries

Python




import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score, classification_report
from catboost import CatBoostClassifier, Pool


Here we are importing some of the libraries such as numpy, pandas, classification metrics and some of the catboost libraries.

CatboostClassifier: A gradient boosting technique designed specifically for classification applications is the “CatBoostClassifier.” The CatBoost library contains it, which is an acronym for “categorical boosting.” CatBoost is well-known for its excellent performance and user-friendliness, and it works especially well with category characteristics.
Pool: The pool data structure in CatBoost is utilized to handle data efficiently for both training and evaluation. It includes features like custom feature names and categorical feature support and is built to operate with huge datasets.

Load the Iris Dataset and Split it into Training and Testing Datasets

Python




# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
 
# Convert the target variable to binary classification (class 0 and class 1)
y = (y == 0).astype(int)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


This code imports the Iris dataset first, which consists of target labels (y) and features (X). Next, it changes the target labels to binary classification, encoding other classes as 0 and class 0 as 1. Ultimately, the dataset is divided into training and testing sets in order to assess the model.

Create CatBoost Pools for efficient Data Handling

Python




# Create CatBoost Pools for efficient data handling
train_pool = Pool(data=X_train, label=y_train, cat_features=[], feature_names=iris.feature_names)
test_pool = Pool(data=X_test, label=y_test, cat_features=[], feature_names=iris.feature_names)


For effective data processing in the CatBoost classifier, these lines generate CatBoost Pools. The labels (y_train and y_test), empty categorical features (cat_features), and feature names from the Iris dataset are added, and the training and testing data (X_train and X_test) are transformed into a unique format appropriate for CatBoost. This enables effective CatBoost training and processing.

Defining CatBoost Parameters

Python




# Define CatBoost parameters
params = {
    'iterations': 100,
    'depth': 6,
    'learning_rate': 0.1,
    'loss_function': 'Logloss'# Classification task
    'custom_metric': ['Accuracy', 'AUC'],  # Additional metrics to track
    'verbose': 10# Print training progress every 10 iterations
    'random_seed': 42  # Set a random seed for reproducibility
}


These lines define a CatBoost classifier’s settings, including the number of boosting iterations, the depth of the ensemble’s trees, the learning rate, the classification loss function (Logloss), and extra metrics to monitor (Accuracy and AUC) during training. The frequency of progress printing is managed by the verbose parameter, and by establishing a random seed, random_seed guarantees reproducibility of results.

Train and Evaluate the CatBoost Model

Python




# Train the CatBoost classifier
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=test_pool)
 
# Make predictions on the test set
y_pred = model.predict(test_pool)
 
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
logloss = log_loss(y_test, model.predict_proba(test_pool)[:, 1])
roc_auc = roc_auc_score(y_test, model.predict_proba(test_pool)[:, 1])
 
# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss:.4f}")
print(f"AUC: {roc_auc:.4f}")


Output:

0:    learn: 0.6333595    test: 0.6326569    best: 0.6326569 (0)    total: 5.68ms    remaining: 563ms
10: learn: 0.2973689 test: 0.2938201 best: 0.2938201 (10) total: 9.87ms remaining: 79.9ms
20: learn: 0.1637735 test: 0.1591490 best: 0.1591490 (20) total: 13.7ms remaining: 51.5ms
30: learn: 0.1051307 test: 0.1011177 best: 0.1011177 (30) total: 17.7ms remaining: 39.5ms
40: learn: 0.0715529 test: 0.0695287 best: 0.0695287 (40) total: 21.5ms remaining: 31ms
50: learn: 0.0533052 test: 0.0515575 best: 0.0515575 (50) total: 25.1ms remaining: 24.1ms
60: learn: 0.0416665 test: 0.0404120 best: 0.0404120 (60) total: 28.6ms remaining: 18.3ms
70: learn: 0.0342899 test: 0.0332187 best: 0.0332187 (70) total: 33.8ms remaining: 13.8ms
80: learn: 0.0294652 test: 0.0286255 best: 0.0286255 (80) total: 37.4ms remaining: 8.78ms
90: learn: 0.0256959 test: 0.0250120 best: 0.0250120 (90) total: 41.2ms remaining: 4.07ms
99: learn: 0.0230690 test: 0.0225294 best: 0.0225294 (99) total: 45.1ms remaining: 0us
bestTest = 0.02252943945
bestIteration = 99
Accuracy: 1.0000
Log Loss: 0.0225
AUC: 1.0000

This code assesses a CatBoost classifier’s performance on a test dataset after training it with given settings. It computes three evaluation measures, namely accuracy, log loss, and area under the ROC curve (AUC), while making predictions on the test set. Measuring the classifier’s overall accuracy, predictive quality, and discriminating power between classes, the metrics offer a thorough assessment of the model’s classification performance.

Classification Report

Python3




# Generate a classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)


Output:

Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 20
1 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

This function compares the predicted (y_pred) and actual (y_test) labels to provide a classification report that may be used to assess a model’s performance. The report is printed to the terminal and contains metrics for each class, including support, F1-score, precision, and recall.

Conclusion

In conclusion, CatBoost is a powerful gradient boosting library that offers a wide range of parameters to fine-tune and optimize your machine learning models. Understanding and appropriately configuring these parameters can significantly impact the performance, interpretability, and efficiency of your CatBoost models.In this article, we’ve explored some key parameters, including `loss_function`, `custom_metric`, `verbose`, `l2_leaf_reg`, `learning_rate`, `depth` and `random_seed`, and provided code samples to illustrate how to use them effectively in your CatBoostClassifier models.CatBoost’s ease of use and excellent out-of-the-box performance make it a popular choice for various machine learning tasks. By mastering these parameters and experimenting with different combinations, you can unlock the full potential of CatBoost and build high-performing, reliable models for your data-driven projects.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads