Open In App

Cross-validation and Hyperparameter tuning of LightGBM Model

Last Updated : 19 Oct, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In a variety of industries, including finance, healthcare, and marketing, machine learning models have become essential for resolving challenging real-world issues. Gradient boosting techniques have become incredibly popular among the myriad of machine learning algorithms due to their remarkable prediction performance. Due to its speed and effectiveness, LightGBM (Light Gradient Boosting Machine) is one such technique that many data scientists and machine learning practitioners now turn to first.

We will examine LightGBM in this post with an emphasis on cross-validation, hyperparameter tweaking, and the deployment of a LightGBM-based application. To clarify the ideas covered, we shall use code examples throughout the article.

Understanding LightGBM

LightGBM is a gradient-boosting framework developed by Microsoft that uses a tree-based learning algorithm. It is specifically designed to be efficient and can handle large datasets with millions of records and features. Some of its key advantages include:

  • Speed: LightGBM is incredibly fast and efficient, making it suitable for both training and prediction tasks.
  • High Performance: It often outperforms other gradient-boosting algorithms in terms of predictive accuracy.
  • Memory Efficiency: LightGBM uses a histogram-based approach for splitting nodes in trees, which reduces memory consumption.
  • Parallel and GPU Support: It can take advantage of multi-core processors and GPUs for even faster training.
  • Built-in Regularization: It includes built-in L1 and L2 regularization to prevent overfitting.
  • Wide Range of Applications: LightGBM can be used for both classification and regression tasks.

Cross-Validation

A machine learning approach called cross-validation is used to evaluate a model’s performance and make sure that it isn’t unduly dependent on a particular training-test split of the data. To gain a more accurate approximation of the model’s performance, you must divide the dataset into several subgroups, train and test the model using various combinations of these subsets, and then average the results.

There are various different cross-validation methods. The most popular ones are:

K-Fold Cross-Validation

In machine learning, K-Fold Cross-Validation is an essential method for assessing and optimizing model performance. It solves overfitting and underfitting issues by methodically separating a dataset into ‘K’ subsets, sometimes known as “folds.” One fold is used as the validation set while the remaining “K-1” folds are used as the training data for each iteration. The test set for each of the ‘K’ training and testing iterations of the model is a distinct fold. For a reliable evaluation of the model’s performance, the data are averaged or otherwise integrated.

K-Fold Cross-Validation has a number of benefits. It produces more accurate performance estimations by maximizing the use of data for both training and validation. Because it assesses the model on many data subsets, it also helps identify problems like overfitting of the model. But it can be computationally demanding, especially when dealing with big datasets or high ‘K’ values. In spite of this, K-Fold Cross-Validation is still widely used in model evaluation to make sure machine learning models have good generalization to new data.

Implementation of K-Fold Cross-Validation

Python3




import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
 
# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
 
# Number of folds
n_splits = 5
 
# Create a KFold cross-validator
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
 
# Initialize a list to store model performance metrics
metrics = []
 
# Define LightGBM hyperparameters
params = {
    'objective': 'multiclass',
    'num_class': 3# Number of classes in the Iris dataset
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
}
 
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
 
    # Create LightGBM datasets for training and testing
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
 
    # Train a LightGBM model
    num_round = 100
    bst = lgb.train(params, train_data, num_round)
 
    # Make predictions on the test set
    y_pred = bst.predict(X_test)
 
    # Get the class with the highest predicted probability as the predicted label
    y_pred_labels = np.argmax(y_pred, axis=1)
 
    # Calculate accuracy and store it in the metrics list
    accuracy = accuracy_score(y_test, y_pred_labels)
    metrics.append(accuracy)
 
# Calculate the average accuracy across all folds
average_accuracy = np.mean(metrics)
print(f'Average Accuracy: {average_accuracy:.4f}')


Output:

Average Accuracy: 0.9600

Using the LightGBM machine learning framework and k-fold cross-validation, the provided code evaluates a multiclass classification model’s performance on the Iris dataset. The dataset is first loaded and split into feature variables (X) and target labels (y). To ensure data shuffling for a robust evaluation, the KFold cross-validator is applied with a predetermined number of folds. The model makes predictions on a test subset after being trained on a training subset for each fold. The predicted label is chosen to be the class with the highest expected probability. The method then determines the accuracy metric for every fold and computes the average accuracy over all folds to provide a general indicator of the model’s classification performance on the Iris dataset.

Stratified K-Fold Cross-Validation

Stratified K-Fold Cross-Validation: It is used to address categorization issues. In order to lessen evaluation bias, it makes sure that each fold has a distribution of class labels that is close to that of the overall dataset.

Every fold in a stratified K-fold preserves the same class distribution as the dataset as a whole. It is especially helpful in classification problems, where biased model assessment might result from unbalanced class distributions. It is a reliable method for determining the optimal hyperparameters and evaluating a model’s generalization capacity because it maintains class proportions in every fold, which results in a more accurate estimation of a model’s performance. In order to ensure that each fold is representative of the total data, this strategy is frequently used when the target variable has an uneven class distribution.

A popular form of K-Fold Cross-Validation for classification issues is called stratified K-Fold Cross-Validation, which makes sure that each fold has a class label distribution that is comparable to the dataset as a whole. As a result, there is less bias and a more accurate evaluation of the model’s performance.

Implementation of Stratified K-Fold Cross-Validation

Python3




import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
 
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
 
# Define hyperparameters for LightGBM
params = {
    'objective': 'multiclass'# For multi-class classification
    'metric': 'multi_logloss'# Logarithmic loss for multiclass
    'boosting_type': 'gbdt',
    'num_class': 3# Number of classes in Iris dataset
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
 
# Number of folds for stratified cross-validation
num_folds = 5
 
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
 
# Initialize an empty list to store cross-validation scores
cv_scores = []
 
# Perform stratified k-fold cross-validation
for train_index, val_index in skf.split(X, y):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
     
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
     
    # Train LightGBM model with early stopping
    model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[val_data])
     
    # Make predictions on the validation set
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
     
    # Convert predicted probabilities to class predictions
    val_pred_classes = np.argmax(val_pred, axis=1)
     
    # Calculate accuracy and store it in the list
    accuracy = accuracy_score(y_val, val_pred_classes)
    cv_scores.append(accuracy)
 
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
 
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')


Output:

Mean Accuracy: 0.9667
Std Accuracy: 0.0298

This code uses the gradient boosting framework LightGBM to illustrate a popular machine learning technique called Stratified K-Fold Cross-Validation. First, the widely used benchmark dataset for classification, Iris, is loaded. The multiclass classification-focused hyperparameters of the model are preset. To divide the data into five subsets and preserve the balance of the class distribution, the StratifiedKFold approach is utilized. A LightGBM model is trained on the training set with early pausing to avoid overfitting inside the cross-validation loop. Accuracy is calculated for every fold based on predictions made on the validation set. A thorough assessment of the model’s performance and capacity for generalization on the Iris dataset is given by the code, which gathers and presents the mean and standard deviation of accuracy scores across the folds.

Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out Cross-Validation (LOOCV): A resampling technique called Leave-One-Out Cross-Validation (LOOCV) is used to evaluate how well machine learning models perform. Using an exhaustive method, the remaining data is used for training and one data point is reserved as the validation set for each round. There are as many rounds as there are data points when this technique is repeated for each data point. Using all available data for both training and validation, LOOCV provides a thorough assessment of a model’s generalization. However, computing costs may be high, particularly for huge datasets. In order to accurately measure a model’s predictive capacity and robustness, the final performance score is typically calculated from the average of the individual validation findings.

(LOOCV) is a cross-validation method in which all of the dataset’s data points are regarded as distinct test sets and the model is trained using them all. Although LOOCV offers a reliable assessment of model performance, it can be computationally costly, particularly when dealing with big datasets.

Implementation of Leave-One-Out Cross-Validation (LOOCV)

Here’s how to use Python to implement LOOCV with LightGBM:

Python3




import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import LeaveOneOut
 
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
 
# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()
 
# Initialize an empty list to store cross-validation scores
cv_scores = []
 
# Perform Leave-One-Out Cross-Validation
for train_index, val_index in loo.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
     
    # Create and configure a LightGBM dataset for training
    train_data = lgb.Dataset(X_train, label=y_train)
     
    # Define hyperparameters for LightGBM
    params = {
        'objective': 'multiclass',
        'num_class': 3,
        'boosting_type': 'gbdt',
        'num_leaves': 5,
        'learning_rate': 0.05,
    }
     
    # Train LightGBM model
    model = lgb.train(params, train_data, num_boost_round=100)
     
    # Make predictions on the validation set
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
     
    # Get the predicted class (index of the highest probability)
    val_pred_class = np.argmax(val_pred, axis=1)
     
    # Calculate accuracy and store it in the list
    accuracy = accuracy_score(y_val, val_pred_class)
    cv_scores.append(accuracy)
 
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
 
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')


Output:

Mean Accuracy: 0.9533
Std Accuracy: 0.2109

With the use of LightGBM and the Iris dataset, this code sample illustrates Leave-One-Out Cross-Validation (LOOCV). Using each data point as a validation set, LOOCV iterates through the data, using the remaining data to train the model. It applies certain hyperparameters to the multiclass classification target of LightGBM. The mean accuracy and standard deviation are computed across the total number of repetitions, and the accuracy is determined for each iteration. By testing the model on each data point separately and making sure that every data point adds to the evaluation, this method offers a thorough assessment of the model’s performance. One may get an idea of the model’s predictive ability and consistency in categorizing samples from the Iris dataset by looking at the final mean accuracy and standard deviation.

LightGBM’s Hyperparameter Tuning

Optimizing the performance of a LightGBM model requires careful consideration of its hyperparameters. To get the best results for a given dataset, LightGBM offers a variety of hyperparameters that can be fine-tuned. Selecting the hyperparameter settings that yield the best model performance is the aim of hyperparameter tuning; this is usually assessed using evaluation metrics such as accuracy, AUC, or log loss. For hyperparameter tuning, two popular methods are grid search and random search. Grid search involves giving the model a predetermined set of hyperparameter values, and the algorithm determines how well the model performs in every possible combination. When the search space is large, random search is more effective since it randomly samples hyperparameters from predetermined ranges.

Two popular methods for hyperparameter tuning are grid search and random search. In grid search, the model’s performance is assessed for every potential combination of hyperparameter values that you specify in advance. On the other hand, random search is more effective when the search space is large since it randomly samples hyperparameters from predetermined ranges.

Implementing Hyperparameter Tuning with LightGBM

Let’s see how to perform hyperparameter tuning with LightGBM.

Import the required libraries

Python3




import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


This code uses GridSearchCV from scikit-learn for hyperparameter tuning and LightGBM, a gradient boosting framework. The model loads the Iris dataset, splits the data into train and test, and then uses grid search to find the optimal hyperparameters. The accuracy measure is used to assess the model’s performance.

Loading Dataset and Splitting Data

Python3




# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


With 80% of the data utilized for training and 20% put aside for testing, this code snippet loads the Iris dataset and divides it into training and testing sets. In order to guarantee reproducibility, the random seed is fixed using the random_state parameter.

Defining Parameters

Python3




# Define a range of values for the hyperparameters to search through
param_grid = {
    'num_leaves': [5, 20, 31],
    'learning_rate': [0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 150]
}


To find the right hyperparameter tuning, this code defines a grid of values for the hyperparameters. It allows for an exhaustive search across these combinations during hyperparameter optimization because it specifies different values for “num_leaves,” “learning_rate,” and “n_estimators.”

Model Development

Python3




# Initialize an empty dictionary to store the best hyperparameters and their values
best_hyperparameters = {}
best_values = {}
 
# Initialize the LightGBM classifier
lgb_classifier = lgb.LGBMClassifier(objective='multiclass', num_class=3, boosting_type='gbdt')
 
# Initialize GridSearchCV for hyperparameters
grid_search = GridSearchCV(estimator=lgb_classifier, param_grid=param_grid,
                           scoring='accuracy', cv=5)
 
# Fit the model to the training data to search for the best hyperparameters
grid_search.fit(X_train, y_train)
 
# Get the best hyperparameters and their values
best_params = grid_search.best_params_
best_hyperparameters = list(best_params.keys())
best_values = list(best_params.values())


This method uses a GridSearchCV with a LightGBM classifier to conduct hyperparameter tuning. In order to save the optimal hyperparameters and their values, it initializes an empty dictionary called best_hyperparameters. Additionally, it initializes GridSearchCV and the LightGBM classifier by providing the estimator, the number of cross-validation folds (cv=5), the scoring metric (‘accuracy’), and the parameter grid to search across (param_grid). Next, the grid_search finds the optimal hyperparameters by fitting the model to the training set.

After the search is finished, the best hyperparameters and their matching values are found and added to best_params, a dictionary where the best values of each hyperparameter are the values and the names of the hyperparameters are the keys. Next, the best_hyperparameters and best_values lists include the optimal hyperparameters and their values, respectively.

Training the model

Python3




# Train a LightGBM model with the best hyperparameters
best_model = lgb.LGBMClassifier(**best_params)
best_model.fit(X_train, y_train)


The optimal hyperparameters found through hyperparameter tuning are used to train a LightGBM model in this code. **best_params** is passed in to initialize a new LightGBM classifier, best_model, with the optimal hyperparameters. Subsequently, it fits the best_model to the training set (X_train and y_train), enabling the training of a model with optimum parameters.

Prediction and Evaluation

Python3




# Make predictions on the test set using the best model
y_pred = best_model.predict(X_test)
 
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
 
print('Best hyperparameters:', best_hyperparameters)
print('Best values:', best_values)
print(f'Accuracy with best hyperparameters: {accuracy:.4f}')


Output:

Best hyperparameters: ['learning_rate', 'n_estimators', 'num_leaves']
Best values: [0.05, 50, 5]
Accuracy with best hyperparameters: 1.0000

This code makes predictions on the test set (X_test) using the best model that was trained with the optimized hyperparameters. These forecasts’ accuracy is computed and reported. It also shows the best values (best_values) and hyperparameters (best_hyperparameters) that produced the highest accuracy for the model.

Conclusion

LightGBM is a potent gradient boosting framework with great prediction accuracy, efficiency, and speed. In order to evaluate model performance, cross-validation is crucial, and hyperparameter tuning can assist in determining the ideal model configuration. Finally, serializing the model and developing a web API for prediction are required for deploying a LightGBM-based application. You are prepared to use LightGBM for your machine learning projects from development to deployment with the information and code examples in this tutorial.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads