Open In App

Regression using LightGBM

Last Updated : 29 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.

What is LightGBM?

LightGBM or ‘Light Gradient Boosting Machine’, is an open source, high-performance gradient boosting framework designed for efficient and scalable machine learning tasks. It is specially tailored for speed and accuracy, making it a popular choice for both structured and unstructured data in diverse domains.

Key characteristics of LightGBM include its ability to handle large datasets with millions of rows and columns, support for parallel and distributed computing, and optimized gradient-boosting algorithms. LightGBM is known for its excellent speed and low memory consumption, thanks to histogram-based techniques and leaf-wise tree growth.

How LightGBM Works?

LightGBM creates a decision tree that develops leaf-wise, which implies that given a condition, just one leaf is split, depending on the benefit. Sometimes, especially with smaller datasets, leaf-wise trees might overfit. Overfitting can be prevented by limiting the tree depth. A histogram of the distribution is used by LightGBM to bucket data into bins. Instead of using every data point, the bins are used to iterate, calculate the gain, and divide the data. Additionally, a sparse dataset can benefit from this method’s optimization. Exclusive feature bundling, which refers to the algorithm’s combining of exclusive features to reduce dimensionality reduction and speed up processing, is another element of LightGBM.

There is another algorithm of lightGBM that is used for sampling the dataset i.e., GOSS (Gradient-based One Side Sampling). Data points with greater gradients are given more weight when computing gain by GOSS. Instances that have not been effectively used for training contribute more in this manner. To maintain accuracy, data points with smaller gradients are arbitrarily deleted while some are kept. Given the same sampling rate as random sampling, this approach is often superior.

Implementation of LightBGM

In this article, we will use this dataset to perform a regression task using the lightGBM algorithm. But to use the LightGBM model we will first have to install the lightGBM model using the below command:

Installing Libraries

!pip install lightgbm==3.3.5

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn – This library is used to draw visualizations.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3




#importing libraries 
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import lightgbm as lgb
  
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
  
import warnings
warnings.filterwarnings('ignore')


Now let’s use the pandas dataframe to load the dataset and then show the first five rows of it.

Loading Dataset and Retriving Information

Python3




# Read the data from a CSV file named 'medical_cost.csv' into a DataFrame 'df'.
df = pd.read_csv('medical_cost.csv')
  
# Display the first few rows of the DataFrame to provide a data preview.
df.head()


Output:

   Id  age     sex     bmi  children smoker     region      charges
0 1 19 female 27.900 0 yes southwest 16884.92400
1 2 18 male 33.770 1 no southeast 1725.55230
2 3 28 male 33.000 3 no southeast 4449.46200
3 4 33 male 22.705 0 no northwest 21984.47061
4 5 32 male 28.880 0 no northwest 3866.85520

Here the code reads the contents of the CSV file and loads them into a DataFrame called “df” using the pandas package. The first few rows of the DataFrame are then displayed using the ‘head()’ method to provide a short preview of the data.

Python3




#shape of the dataframe
df.shape


Output:

(1338, 8)

Here, the ‘df.shape’ command prints the dimensions of the dataframe.

Python3




df.info()


Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1338 non-null int64
1 age 1338 non-null int64
2 sex 1338 non-null object
3 bmi 1338 non-null float64
4 children 1338 non-null int64
5 smoker 1338 non-null object
6 region 1338 non-null object
7 charges 1338 non-null float64
dtypes: float64(2), int64(3), object(3)
memory usage: 83.8+ KB

By using the df.info() function we can see the content of each columns and the data types present in it along with the number of null values present in each column.

Python3




#describing the data
print(df.describe())


Output:

                Id          age          bmi     children       charges
count 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000
mean 669.500000 39.207025 30.663397 1.094918 13270.422265
std 386.391641 14.049960 6.098187 1.205493 12110.011237
min 1.000000 18.000000 15.960000 0.000000 1121.873900
25% 335.250000 27.000000 26.296250 0.000000 4740.287150
50% 669.500000 39.000000 30.400000 1.000000 9382.033000
75% 1003.750000 51.000000 34.693750 2.000000 16639.912515
max 1338.000000 64.000000 53.130000 5.000000 63770.428010

The DataFrame df is described statistically via the df.describe() function. In order to provide a preliminary understanding of the data’s central tendencies and distribution, it includes important statistics such as count, mean, standard deviation, minimum, and maximum values for each numerical column.

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other.

Analyzing the average charges charged by the medical insurance company can be better understood by calculating the average charges based on each categorical column in the dataset.

Categorical Column Mean Analysis

Python3




# Define a list of columns to analyze
sample = ['sex', 'children', 'smoker', 'region']
  
# Iterate through the specified columns
for i, col in enumerate(sample):
    # Group the DataFrame by the current column and calculate the mean of 'charges'
    grouped_data = df[[col, 'charges']].groupby(col).mean()
      
    # Print the mean charges for each group within the current column
    print(grouped_data)
      
    # Print a blank line to separate the output for different columns
    print()


Output:

              charges
sex
female 12569.578844
male 13956.751178
charges
children
0 12365.975602
1 12731.171832
2 15073.563734
3 15355.318367
4 13850.656311
5 8786.035247
charges
smoker
no 8434.268298
yes 32050.231832
charges
region
northeast 13406.384516
northwest 12417.575374
southeast 14735.411438
southwest 12346.937377

Here, this code examines a dataset of medical expense information that is displayed as a DataFrame (‘df’). The categorical categories with the most emphasis are those for “sex,” “children,” “smoker,” and “region.” The mean medical charges within each category are calculated by the algorithm after it systematically groups the data by these columns. The output offers insightful information about how these categorical variables affect the typical medical expenses incurred. The use of blank lines between columns improves reading and comprehension and allows for a more thorough explanation of the properties of the dataset. From the above results we can conclude some points from it.

  • Mens are charged more than women for a medical insurance.
  • There is no as such trend in the number of children versus the charges.
  • Smoker have to pay visibly high charges for a medical insurance than a non-smoker person because of the health problems a smoker have to face.
  • Medical charges are highest in the southeast region and lowest in the southwest but the difference is not that much sound.

Categorical Column Visualization

Python3




# Set the figure size for the subplots
plt.subplots(figsize=(15, 15))
  
# Define the list of columns to visualize
sample = ['sex', 'children', 'smoker', 'region']
  
# Loop through the specified columns
for i, col in enumerate(sample):
    # Create subplots in a 3x2 grid
    plt.subplot(3, 2, i + 1)
      
    # Create a countplot for the current column
    sb.countplot(data=df, x=col)
      
# Adjust subplot layout for better presentation
plt.tight_layout()
  
# Display the subplots
plt.show()


Output:

img3

This code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). The next step iterates over a list of categorical columns (such as “sex,” “children,” “smoker,” and “region”); for each column, it generates a countplot that shows the distribution of values in the DataFrame (abbreviated “df”). Finally, the subplots are presented using ‘plt.show()’, giving a visual summary of the categorical data distributions in the dataset. The ‘plt.tight_layout()’ function is used to improve the layout of subplots for better presentation. From the above output we can conclude the below points:

  • Dataset have equal number of cases for the males and females. Also similar is the case for the four regions from which this data has been collected.
  • Smoker people are lesser than the non-smoker in the dataset.
  • Number of people with no children to 5 children is in decreasing order.

Distribution Analysis of Numeric Columns

Python3




# Set the figure size for the subplots
plt.subplots(figsize=(15, 15))
  
# Define the list of numeric columns to visualize
sample = ['age', 'bmi', 'charges']
  
# Loop through the specified columns
for i, col in enumerate(sample):
    # Create subplots in a 3x2 grid
    plt.subplot(3, 2, i + 1)
      
    # Create a distribution plot for the current numeric column
    sb.distplot(df[col])
      
# Adjust subplot layout for better presentation
plt.tight_layout()
  
# Display the subplots
plt.show()


Output:

img4

Here, this code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). Then, after iterating through a number of numerical columns (including “age,” “bmi,” and “charges”), it generates a distribution plot (histogram) for each column that shows how values within that particular column of the DataFrame (abbreviated “df”) are distributed. The subplots are then displayed, giving visual insights into the distribution of the data for these numerical aspects. The ‘plt.tight_layout()’ function is used to improve the layout of the subplots for better presentation.

Data Preprocessing

Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:

Log Transformation and Distribution Plot

Python3




# Apply the natural logarithm transformation to the 'charges' column
df['charges'] = np.log1p(df['charges'])
  
# Create a distribution plot for the transformed 'charges' column
sb.distplot(df['charges'])
  
# Display the distribution plot
plt.show()


Output:

img5

The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.

One-Hot Encoding Categorical Columns

Python3




# Mapping Categorical to Numerical Values
  
# Map 'sex' column values ('male' to 0, 'female' to 1)
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
  
# Map 'smoker' column values ('no' to 0, 'yes' to 1)
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
  
# Display the DataFrame's first few rows to show the transformations
df.head()


Output:

    age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 NaN 27.900 0 NaN 9.734236 0 0
1 18 NaN 33.770 1 NaN 7.453882 0 0
2 28 NaN 33.000 3 NaN 8.400763 0 0
3 33 NaN 22.705 0 NaN 9.998137 0 1
4 32 NaN 28.880 0 NaN 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.

One-hot encoding for “Region” column

Python3




# Perform one-hot encoding on the 'region' column
temp = pd.get_dummies(df['region']).astype('int')
  
# Concatenate the one-hot encoded columns with the original DataFrame
df = pd.concat([df, temp], axis=1)


This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.

Python3




# Remove 'Id' and 'region' columns from the DataFrame
df.drop(['Id', 'region'], inplace=True, axis=1)
  
# Display the updated DataFrame
print(df.head())


Output:

   age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 1 27.900 0 1 9.734236 0 0
1 18 0 33.770 1 0 7.453882 0 0
2 28 0 33.000 3 0 8.400763 0 0
3 33 0 22.705 0 0 9.998137 0 1
4 32 0 28.880 0 0 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.

Splitting Data

Python3




# Define the features
features = df.drop('charges', axis=1)
  
# Define the target variable as 'charges'
target = df['charges']
  
# Split the data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(features, target,
                                                  random_state=2023,
                                                  test_size=0.25)
  
# Display the shapes of the training and validation sets
X_train.shape, X_val.shape


Output:

((1003, 11), (335, 11))

To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.

Feature scaling

Python3




# Standardize Features
  
# Use StandardScaler to scale the training and validation data
scaler = StandardScaler()
#Fit the StandardScaler to the training data
scaler.fit(X_train)
# transform both the training and validation data
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)


This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.

Dataset Preparation

Python3




# Create a LightGBM dataset for training with features X_train and labels Y_train
train_data = lgb.Dataset(X_train, label=Y_train)
  
# Create a LightGBM dataset for testing with features X_val and labels Y_val,
# and specify the reference dataset as train_data for consistent evaluation
test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)


Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.

Regression Model using LightGBM

Now one last thing that is remaining is to define some parameters that we must have to pass for the training process of the model and the arguments that will be used for the same.

Python3




# Define a dictionary of parameters for configuring the LightGBM regression model.
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
}


Let’s take a quick look at the parameters that has been passed to the model.

  • objective – This defines on what type of task you are going to train your model on for e.g regression has been passed here for regression task also we can pass classification and multi-class classification for binary as well as multi class classification.
  • metric – Metric that will be used by the model to improve upon. Also we will be able to get the model’s performance on the validation data(if passed) as the training process goes on.
  • boosting_type – It is the method that is been used by the lightgbm model to train the parameters of the model for e.g GBDT(Gradient Boosting Decision Trees) that is the default method and rf(random forest based) and one more is dart(Dropouts meet Multiple Additive Regression Trees).
  • num_leaves – The default value is 31 and it is used to define the maximum number of leaf nodes in a tree.
  • learning_rate – As we know that the learning rate is a very common hyperparameter that is used to control the learning process.
  • feature_fraction – This is the fraction of the features that will be used initially to train the decision trees. If we set this to 0.9 that means 90% of the features will be used only. this help us deal with the problem of overfitting.

Let’s train the model for 100 epoch on the training data and we will pass the validation data as well to visualize the performance of the model on the unseen data while training process goes on. This helps us to keep a check on the training progress.

Python3




# Set the number of rounds and train the model with early stopping
num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[
                test_data], early_stopping_rounds=10)


Output:

[70]    valid_0's rmse: 0.387332
[71] valid_0's rmse: 0.387193
[72] valid_0's rmse: 0.387407
[73] valid_0's rmse: 0.387696
[74] valid_0's rmse: 0.388172
[75] valid_0's rmse: 0.388142
[76] valid_0's rmse: 0.388688
Early stopping, best iteration is:
[66] valid_0's rmse: 0.386691

This code snippet trains a LightGBM model using the supplied parameters (params) and training data (train_data) and sets the number of boosting rounds (num_round). Early stopping is used, where the model keeps track of how it performs on the test data validation dataset and terminates training if no improvement is seen after 10 rounds. This ensures that the model finishes training when it reaches its optimal performance and prevents overfitting. Here we can observe that the Root Mean Square Error value for the validation data is 0.386691 that is a very good score for a regression metric.

Prediction and Evaluation of Model

Python3




# Import necessary libraries for calculating mean squared error and using the LightGBM regressor.
from sklearn.metrics import mean_squared_error as mse
from lightgbm import LGBMRegressor
  
# Create an instance of the LightGBM Regressor with the RMSE metric.
model = LGBMRegressor(metric='rmse')
  
# Train the model using the training data.
model.fit(X_train, Y_train)
  
# Make predictions on the training and validation data.
y_train = model.predict(X_train)
y_val = model.predict(X_val)


Here, it utilizes the lightGBM library for regression modelling. The model is trained on the provided training data and predictions are made on both training and validation datasets.

Validation of the Model

Python3




# Calculate and print the Root Mean Squared Error (RMSE) for training and validation predictions.
print("Training RMSE: ", np.sqrt(mse(Y_train, y_train)))
print("Validation RMSE: ", np.sqrt(mse(Y_val, y_val)))


Output:

Training RMSE:  0.2331835443343122
Validation RMSE: 0.40587871797893876

Here, this code computes and displayed the RMSE, a measure of prediction accuracy, for both the training and validation datasets. It assesses how well lightGBM regression model performs on the data, with lower RMSE values indicating better model fit.

Conclusion

In conclusion, the utilization of LightGBM for regression tasks presents a robust and efficient approach to predictive modelling. Beginning with data preparation and feature engineering, this process proceeds with the training of a lightGBM regressor model, configured with specific hyperparameters and evaluation metrics. The model’s ability to efficiently learn from the training data, make predictions, and provide interpretability through feature importance scores is a valuable asset. Evaluation metrics like RMSE helps gauge prediction accuracy. LightGBm’s speed, scalability and strong predictive performance make it a compelling choice, particularly handling large datasets and achieving high-quality regression results in various real-world applications.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads