Skip to content
Related Articles
Open in App
Not now

Related Articles

CatBoost – ML

Improve Article
Save Article
Like Article
  • Difficulty Level : Expert
  • Last Updated : 22 Jan, 2021
Improve Article
Save Article
Like Article

Gradient Boosting is an ensemble machine learning algorithm and typically used for solving classification and regression problems. It is easy to use and works well with heterogeneous data and even relatively small data. It essentially creates a strong learner from an ensemble of many weak learners.

CatBoost or Categorical Boosting is an open-source boosting library developed by Yandex. In addition to regression and classification, CatBoost can be used in ranking, recommendation systems, forecasting and even personal assistants.

Now, Gradient Boosting takes an additive form where it iteratively builds a sequence of approximations F^t    in a greedy manner, given a loss function \mathcal{L}(y_i,F^t) . Here we would like to emphasize that the loss function has two input values, the ith expected output value yi, and the tth function Ft that estimates yi. Assuming we have constructed function Ft we can improve our estimates of yi by finding another function F^t = F^{t-1}+ \alpha \cdot h^t    , where \alpha    is a step size and function h^t    is a base predictor chosen from a family of functions H in order to minimize the expected loss. That is,  h^t = \underset{h\epsilon H}{arg \; min} \; \mathbb{E} \mathcal L(y,F^{t-1}+h) . The minimization is approached by using Taylor approximation or the negative gradients, such that, h^t = \underset{h\epsilon H}{arg \; min} \; \mathbb{E} (\frac{\delta \mathcal Ly}{\delta F^{t-1}}-h)^2 \approx \underset{h\epsilon H}{arg \; min} \; \frac{1}{n} (\frac{\delta \mathcal Ly}{\delta F^{t-1}}-h)^2  . CatBoost makes refinements to this gradient boosting technique. 

Let there be a dataset D with n samples. Each sample has m set of features in a vector x, and a real valued target, y. Dataset\; \mathcal{D} = {x_k, y_k} ( |\mathcal{D}|=n,\; x_k\;\epsilon\;\mathbb{R}^m,\; y_k\; \epsilon\; \mathbb{R})

Handling categorical features:

Often datasets contain categorical features and there are various techniques to handle categorical features in boosted trees. Unlike other gradient boosting algorithms (require numeric data), CatBoost automatically handles categorical features. One of the most common techniques for handling categorical data is one-hot encoding, but it becomes infeasible with many features. To tackle this, features are grouped in categories by target statistics (estimate target value for each category). Target statistics can be calculated in different ways: Greedy, Hold out, Leave one out and Ordered. CatBoost uses Ordered target statistics. 

The greedy approach takes an average of the target for a category group. But it suffers from target leakage as the target value is being used to calculate a representation for the categorical variables and then using those features for prediction (\hat{x}_k^i    is calculated using target y_k   ). The Holdout method tries to reduce this by partitioning the training data set. But this significantly reduces the effective use of training data. Leave one out excludes the target sample but is not very effective. Ordered target statistics are inspired by Online Learning algorithms which get the training examples sequentially in time. It introduces artificial time, that is, a random permutation \sigma    of the training examples. It will only rely on the training examples encountered in its past (samples occurred before that particular sample in the artificial time) thereby avoiding target leakage. 

Mathematically, the target estimate of the ith categorical variable of the kth element of D can be represented as,

\hat{x}_k^i = \frac {\sum_{x_j \epsilon D_k} 1_{x_k^i = x_k^j}.y_j \;+\; ap} {\sum_{x_j \epsilon D_k} 1_{x_k^i = x_k^j} \;+\; a}    ;\; if\; D_k = \{x_j: \;\sigma(j) < \sigma(i)\}   , where a>0

The indicator function 1_{x_k^i = x_k^j}  takes the value 1 when the ith component of CatBoost’s input vector xj is equal to the ith component of the input vector xk. Here we use k as in the kth element according to the order we put on D with the random permutation \sigma , and i takes on the integer values 1 through k−1. The parameters a and p (prior) save the equation from underflowing. The if condition ensures the exclusion of the value of yk in the computation of values for xi when encoding the value x^i_k . This technique also ensures the use all the available past for each example to compute its target statistics and thereby encoding the categorical variables.

 Ordered boosting

Gradient boosting algorithms often have a tendency to overfit. Since ensembles work iteratively building upon the base learners over the same dataset, it affects the generalization capability of the model. 

When we use ordered target statistics to encode categorical variables, the partial derivatives (gradients) of the loss function L with respect to the function Ft-1 is also a random variable because we use the random permutation \sigma(k)  to choose the elements of Dk to encode categorical variables that influence the value of Ft-1. Therefore, the distribution of gradients can be shifted under the condition that we calculated \frac{\delta \mathcal{L}y}{\delta F^{t-1}}  with a particular encoding for x^i_k . This conditional shift leads to bias in the estimate we make for ht, and that negatively impacts the metrics we obtain when evaluating of Ft-1 on data, we did not use at training time. This impact on Ft-1 is referred to as its generalization capability and the problem is called a prediction shift.

CatBoost introduces ordered boosting to avoid this problem. In ordered boosting, a random permutation of the training examples is performed and n different supporting models maintained (ith model trained using only the first i samples in the permutation) and at each step residual/error is obtained by using previous model residuals. But this is not feasible as data is finite and the memory requirement for maintaining different models would be too high. 

So CatBoost uses a variant for practical purposes. In this variant, one tree structure (sequence of splitting features) is shared by all the models. That is, CatBoost uses the same Dk, that determined the ordered target statistics, as the data for determining the structure or fitting the decision tree ht, and uses the complete dataset D as the data for evaluating whether ht is the decision tree that minimizes the expected loss. It uses multiple permutations  \sigma_1...\sigma_n  to compute a number of sets of residual values that it can use to find h, to obtain Ft-1, and maintain the guarantee that none of the values of is used to compute the values of the gradients. Thereby reducing variance in the estimates of the gradients (rate of change of the loss function) and avoiding prediction shift.

CatBoost advantages

  • CatBoost implements oblivious decision trees (binary tree in which same features are used to make left and right split for each level of the tree) thereby restricting the features split per level to one, which help in decreasing prediction time.
  • It handles categorical features effectively by ordered target statistics.
  • It is easy to use with packages in R and Python.
  • It has effective usage with default parameters thereby reducing the time needed for parameter tuning.

References:

papers.nips.cc/paper/2018


My Personal Notes arrow_drop_up
Like Article
Save Article
Related Articles

Start Your Coding Journey Now!