Regression using LightGBM

In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.

What is LightGBM?

LightGBM or ‘Light Gradient Boosting Machine’, is an open source, high-performance gradient boosting framework designed for efficient and scalable machine learning tasks. It is specially tailored for speed and accuracy, making it a popular choice for both structured and unstructured data in diverse domains.

Key characteristics of LightGBM include its ability to handle large datasets with millions of rows and columns, support for parallel and distributed computing, and optimized gradient-boosting algorithms. LightGBM is known for its excellent speed and low memory consumption, thanks to histogram-based techniques and leaf-wise tree growth.

How LightGBM Works?

LightGBM creates a decision tree that develops leaf-wise, which implies that given a condition, just one leaf is split, depending on the benefit. Sometimes, especially with smaller datasets, leaf-wise trees might overfit. Overfitting can be prevented by limiting the tree depth. A histogram of the distribution is used by LightGBM to bucket data into bins. Instead of using every data point, the bins are used to iterate, calculate the gain, and divide the data. Additionally, a sparse dataset can benefit from this method’s optimization. Exclusive feature bundling, which refers to the algorithm’s combining of exclusive features to reduce dimensionality reduction and speed up processing, is another element of LightGBM.

There is another algorithm of lightGBM that is used for sampling the dataset i.e., GOSS (Gradient-based One Side Sampling). Data points with greater gradients are given more weight when computing gain by GOSS. Instances that have not been effectively used for training contribute more in this manner. To maintain accuracy, data points with smaller gradients are arbitrarily deleted while some are kept. Given the same sampling rate as random sampling, this approach is often superior.

Implementation of LightBGM

In this article, we will use this dataset to perform a regression task using the lightGBM algorithm. But to use the LightGBM model we will first have to install the lightGBM model using the below command:

Installing Libraries

!pip install lightgbm==3.3.5

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
Matplotlib/Seaborn – This library is used to draw visualizations.
Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3

#importing libraries  

import pandas as pd 

import numpy as np 

import seaborn as sb 

import matplotlib.pyplot as plt 

import lightgbm as lgb 

from sklearn.preprocessing import StandardScaler 

from sklearn.model_selection import train_test_split 

import warnings 

warnings.filterwarnings('ignore')

Now let’s use the pandas dataframe to load the dataset and then show the first five rows of it.

Loading Dataset and Retriving Information

Python3

# Read the data from a CSV file named 'medical_cost.csv' into a DataFrame 'df'. 

df = pd.read_csv('medical_cost.csv') 

# Display the first few rows of the DataFrame to provide a data preview. 
df.head()

Output:

   Id  age     sex     bmi  children smoker     region      charges
0   1   19  female  27.900         0    yes  southwest  16884.92400
1   2   18    male  33.770         1     no  southeast   1725.55230
2   3   28    male  33.000         3     no  southeast   4449.46200
3   4   33    male  22.705         0     no  northwest  21984.47061
4   5   32    male  28.880         0     no  northwest   3866.85520

Here the code reads the contents of the CSV file and loads them into a DataFrame called “df” using the pandas package. The first few rows of the DataFrame are then displayed using the ‘head()’ method to provide a short preview of the data.

Python3

#shape of the dataframe 
df.shape

Output:

(1338, 8)

Here, the ‘df.shape’ command prints the dimensions of the dataframe.

Python3

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Id        1338 non-null   int64  
 1   age       1338 non-null   int64  
 2   sex       1338 non-null   object 
 3   bmi       1338 non-null   float64
 4   children  1338 non-null   int64  
 5   smoker    1338 non-null   object 
 6   region    1338 non-null   object 
 7   charges   1338 non-null   float64
dtypes: float64(2), int64(3), object(3)
memory usage: 83.8+ KB

By using the df.info() function we can see the content of each columns and the data types present in it along with the number of null values present in each column.

Python3

#describing the data 

print(df.describe())

Output:

                Id          age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000  1338.000000   1338.000000
mean    669.500000    39.207025    30.663397     1.094918  13270.422265
std     386.391641    14.049960     6.098187     1.205493  12110.011237
min       1.000000    18.000000    15.960000     0.000000   1121.873900
25%     335.250000    27.000000    26.296250     0.000000   4740.287150
50%     669.500000    39.000000    30.400000     1.000000   9382.033000
75%    1003.750000    51.000000    34.693750     2.000000  16639.912515
max    1338.000000    64.000000    53.130000     5.000000  63770.428010

The DataFrame df is described statistically via the df.describe() function. In order to provide a preliminary understanding of the data’s central tendencies and distribution, it includes important statistics such as count, mean, standard deviation, minimum, and maximum values for each numerical column.

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other.

Analyzing the average charges charged by the medical insurance company can be better understood by calculating the average charges based on each categorical column in the dataset.

Categorical Column Mean Analysis

Python3

# Define a list of columns to analyze 

sample = ['sex', 'children', 'smoker', 'region'] 

# Iterate through the specified columns 

for i, col in enumerate(sample): 

    # Group the DataFrame by the current column and calculate the mean of 'charges' 

    grouped_data = df[[col, 'charges']].groupby(col).mean() 

    # Print the mean charges for each group within the current column 

    print(grouped_data) 

    # Print a blank line to separate the output for different columns 

    print()

Output:

              charges
sex                 
female  12569.578844
male    13956.751178
               charges
children              
0         12365.975602
1         12731.171832
2         15073.563734
3         15355.318367
4         13850.656311
5          8786.035247
             charges
smoker              
no       8434.268298
yes     32050.231832
                charges
region                 
northeast  13406.384516
northwest  12417.575374
southeast  14735.411438
southwest  12346.937377

Here, this code examines a dataset of medical expense information that is displayed as a DataFrame (‘df’). The categorical categories with the most emphasis are those for “sex,” “children,” “smoker,” and “region.” The mean medical charges within each category are calculated by the algorithm after it systematically groups the data by these columns. The output offers insightful information about how these categorical variables affect the typical medical expenses incurred. The use of blank lines between columns improves reading and comprehension and allows for a more thorough explanation of the properties of the dataset. From the above results we can conclude some points from it.

Mens are charged more than women for a medical insurance.
There is no as such trend in the number of children versus the charges.
Smoker have to pay visibly high charges for a medical insurance than a non-smoker person because of the health problems a smoker have to face.
Medical charges are highest in the southeast region and lowest in the southwest but the difference is not that much sound.

Categorical Column Visualization

Python3

# Set the figure size for the subplots 

plt.subplots(figsize=(15, 15)) 

# Define the list of columns to visualize 

sample = ['sex', 'children', 'smoker', 'region'] 

# Loop through the specified columns 

for i, col in enumerate(sample): 

    # Create subplots in a 3x2 grid 

    plt.subplot(3, 2, i + 1) 

    # Create a countplot for the current column 

    sb.countplot(data=df, x=col) 

# Adjust subplot layout for better presentation 
plt.tight_layout() 

# Display the subplots 
plt.show()

Output:

This code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). The next step iterates over a list of categorical columns (such as “sex,” “children,” “smoker,” and “region”); for each column, it generates a countplot that shows the distribution of values in the DataFrame (abbreviated “df”). Finally, the subplots are presented using ‘plt.show()’, giving a visual summary of the categorical data distributions in the dataset. The ‘plt.tight_layout()’ function is used to improve the layout of subplots for better presentation. From the above output we can conclude the below points:

Dataset have equal number of cases for the males and females. Also similar is the case for the four regions from which this data has been collected.
Smoker people are lesser than the non-smoker in the dataset.
Number of people with no children to 5 children is in decreasing order.

Distribution Analysis of Numeric Columns

Python3

# Set the figure size for the subplots 

plt.subplots(figsize=(15, 15)) 

# Define the list of numeric columns to visualize 

sample = ['age', 'bmi', 'charges'] 

# Loop through the specified columns 

for i, col in enumerate(sample): 

    # Create subplots in a 3x2 grid 

    plt.subplot(3, 2, i + 1) 

    # Create a distribution plot for the current numeric column 

    sb.distplot(df[col]) 

# Adjust subplot layout for better presentation 
plt.tight_layout() 

# Display the subplots 
plt.show()

Output:

Here, this code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). Then, after iterating through a number of numerical columns (including “age,” “bmi,” and “charges”), it generates a distribution plot (histogram) for each column that shows how values within that particular column of the DataFrame (abbreviated “df”) are distributed. The subplots are then displayed, giving visual insights into the distribution of the data for these numerical aspects. The ‘plt.tight_layout()’ function is used to improve the layout of the subplots for better presentation.

Data Preprocessing

Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:

Log Transformation and Distribution Plot

Python3

# Apply the natural logarithm transformation to the 'charges' column 

df['charges'] = np.log1p(df['charges']) 

# Create a distribution plot for the transformed 'charges' column 

sb.distplot(df['charges']) 

# Display the distribution plot 
plt.show()

Output:

The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.

One-Hot Encoding Categorical Columns

Python3

# Mapping Categorical to Numerical Values 

# Map 'sex' column values ('male' to 0, 'female' to 1) 

df['sex'] = df['sex'].map({'male': 0, 'female': 1}) 

# Map 'smoker' column values ('no' to 0, 'yes' to 1) 

df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1}) 

# Display the DataFrame's first few rows to show the transformations 
df.head()

Output:

    age  sex     bmi  children  smoker   charges  northeast  northwest  \
0   19  NaN  27.900         0     NaN  9.734236          0          0   
1   18  NaN  33.770         1     NaN  7.453882          0          0   
2   28  NaN  33.000         3     NaN  8.400763          0          0   
3   33  NaN  22.705         0     NaN  9.998137          0          1   
4   32  NaN  28.880         0     NaN  8.260455          0          1   
   southeast  southwest  
0          0          1  
1          1          0  
2          1          0  
3          0          0  
4          0          0

This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.

One-hot encoding for “Region” column

Python3

# Perform one-hot encoding on the 'region' column 

temp = pd.get_dummies(df['region']).astype('int') 

# Concatenate the one-hot encoded columns with the original DataFrame 

df = pd.concat([df, temp], axis=1)

This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.

Python3

# Remove 'Id' and 'region' columns from the DataFrame 

df.drop(['Id', 'region'], inplace=True, axis=1) 

# Display the updated DataFrame 

print(df.head())

Output:

   age  sex     bmi  children  smoker   charges  northeast  northwest  \
0   19    1  27.900         0       1  9.734236          0          0   
1   18    0  33.770         1       0  7.453882          0          0   
2   28    0  33.000         3       0  8.400763          0          0   
3   33    0  22.705         0       0  9.998137          0          1   
4   32    0  28.880         0       0  8.260455          0          1   
   southeast  southwest  
0          0          1  
1          1          0  
2          1          0  
3          0          0  
4          0          0

Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.

Splitting Data

Python3

# Define the features 

features = df.drop('charges', axis=1) 

# Define the target variable as 'charges' 

target = df['charges'] 

# Split the data into training and validation sets 

X_train, X_val, Y_train, Y_val = train_test_split(features, target, 

                                                  random_state=2023, 

                                                  test_size=0.25) 

# Display the shapes of the training and validation sets 
X_train.shape, X_val.shape

Output:

((1003, 11), (335, 11))

To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.

Feature scaling

Python3

# Standardize Features 

  
# Use StandardScaler to scale the training and validation data 

scaler = StandardScaler() 
#Fit the StandardScaler to the training data 
scaler.fit(X_train) 
# transform both the training and validation data 

X_train = scaler.transform(X_train) 

X_val = scaler.transform(X_val) 

This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.

Dataset Preparation

Python3

# Create a LightGBM dataset for training with features X_train and labels Y_train 

train_data = lgb.Dataset(X_train, label=Y_train) 

# Create a LightGBM dataset for testing with features X_val and labels Y_val, 
# and specify the reference dataset as train_data for consistent evaluation 

test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)

Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.

Regression Model using LightGBM

Now one last thing that is remaining is to define some parameters that we must have to pass for the training process of the model and the arguments that will be used for the same.

Python3

# Define a dictionary of parameters for configuring the LightGBM regression model. 

params = { 

    'objective': 'regression', 

    'metric': 'rmse', 

    'boosting_type': 'gbdt', 

    'num_leaves': 31, 

    'learning_rate': 0.05, 

    'feature_fraction': 0.9, 
}

Let’s take a quick look at the parameters that has been passed to the model.

objective – This defines on what type of task you are going to train your model on for e.g regression has been passed here for regression task also we can pass classification and multi-class classification for binary as well as multi class classification.
metric – Metric that will be used by the model to improve upon. Also we will be able to get the model’s performance on the validation data(if passed) as the training process goes on.
boosting_type – It is the method that is been used by the lightgbm model to train the parameters of the model for e.g GBDT(Gradient Boosting Decision Trees) that is the default method and rf(random forest based) and one more is dart(Dropouts meet Multiple Additive Regression Trees).
num_leaves – The default value is 31 and it is used to define the maximum number of leaf nodes in a tree.
learning_rate – As we know that the learning rate is a very common hyperparameter that is used to control the learning process.
feature_fraction – This is the fraction of the features that will be used initially to train the decision trees. If we set this to 0.9 that means 90% of the features will be used only. this help us deal with the problem of overfitting.

Let’s train the model for 100 epoch on the training data and we will pass the validation data as well to visualize the performance of the model on the unseen data while training process goes on. This helps us to keep a check on the training progress.

Python3

# Set the number of rounds and train the model with early stopping 

num_round = 100

bst = lgb.train(params, train_data, num_round, valid_sets=[ 

                test_data], early_stopping_rounds=10)

Output:

[70]    valid_0's rmse: 0.387332
[71]    valid_0's rmse: 0.387193
[72]    valid_0's rmse: 0.387407
[73]    valid_0's rmse: 0.387696
[74]    valid_0's rmse: 0.388172
[75]    valid_0's rmse: 0.388142
[76]    valid_0's rmse: 0.388688
Early stopping, best iteration is:
[66]    valid_0's rmse: 0.386691

This code snippet trains a LightGBM model using the supplied parameters (params) and training data (train_data) and sets the number of boosting rounds (num_round). Early stopping is used, where the model keeps track of how it performs on the test data validation dataset and terminates training if no improvement is seen after 10 rounds. This ensures that the model finishes training when it reaches its optimal performance and prevents overfitting. Here we can observe that the Root Mean Square Error value for the validation data is 0.386691 that is a very good score for a regression metric.

Prediction and Evaluation of Model

Python3

# Import necessary libraries for calculating mean squared error and using the LightGBM regressor. 

from sklearn.metrics import mean_squared_error as mse 

from lightgbm import LGBMRegressor 

# Create an instance of the LightGBM Regressor with the RMSE metric. 

model = LGBMRegressor(metric='rmse') 

# Train the model using the training data. 
model.fit(X_train, Y_train) 

# Make predictions on the training and validation data. 

y_train = model.predict(X_train) 

y_val = model.predict(X_val)

Here, it utilizes the lightGBM library for regression modelling. The model is trained on the provided training data and predictions are made on both training and validation datasets.

Validation of the Model

Python3

# Calculate and print the Root Mean Squared Error (RMSE) for training and validation predictions. 

print("Training RMSE: ", np.sqrt(mse(Y_train, y_train))) 

print("Validation RMSE: ", np.sqrt(mse(Y_val, y_val))) 

Output:

Training RMSE:  0.2331835443343122
Validation RMSE:  0.40587871797893876

Here, this code computes and displayed the RMSE, a measure of prediction accuracy, for both the training and validation datasets. It assesses how well lightGBM regression model performs on the data, with lower RMSE values indicating better model fit.

Conclusion

In conclusion, the utilization of LightGBM for regression tasks presents a robust and efficient approach to predictive modelling. Beginning with data preparation and feature engineering, this process proceeds with the training of a lightGBM regressor model, configured with specific hyperparameters and evaluation metrics. The model’s ability to efficiently learn from the training data, make predictions, and provide interpretability through feature importance scores is a valuable asset. Evaluation metrics like RMSE helps gauge prediction accuracy. LightGBm’s speed, scalability and strong predictive performance make it a compelling choice, particularly handling large datasets and achieving high-quality regression results in various real-world applications.

Article Tags :

AI-ML-DS

Geeks Premier League

Machine Learning

Geeks Premier League 2023

LightGBM