Open In App

Regression using LightGBM

In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.

What is LightGBM?

LightGBM or ‘Light Gradient Boosting Machine’, is an open source, high-performance gradient boosting framework designed for efficient and scalable machine learning tasks. It is specially tailored for speed and accuracy, making it a popular choice for both structured and unstructured data in diverse domains.



Key characteristics of LightGBM include its ability to handle large datasets with millions of rows and columns, support for parallel and distributed computing, and optimized gradient-boosting algorithms. LightGBM is known for its excellent speed and low memory consumption, thanks to histogram-based techniques and leaf-wise tree growth.

How LightGBM Works?

LightGBM creates a decision tree that develops leaf-wise, which implies that given a condition, just one leaf is split, depending on the benefit. Sometimes, especially with smaller datasets, leaf-wise trees might overfit. Overfitting can be prevented by limiting the tree depth. A histogram of the distribution is used by LightGBM to bucket data into bins. Instead of using every data point, the bins are used to iterate, calculate the gain, and divide the data. Additionally, a sparse dataset can benefit from this method’s optimization. Exclusive feature bundling, which refers to the algorithm’s combining of exclusive features to reduce dimensionality reduction and speed up processing, is another element of LightGBM.



There is another algorithm of lightGBM that is used for sampling the dataset i.e., GOSS (Gradient-based One Side Sampling). Data points with greater gradients are given more weight when computing gain by GOSS. Instances that have not been effectively used for training contribute more in this manner. To maintain accuracy, data points with smaller gradients are arbitrarily deleted while some are kept. Given the same sampling rate as random sampling, this approach is often superior.

Implementation of LightBGM

In this article, we will use this dataset to perform a regression task using the lightGBM algorithm. But to use the LightGBM model we will first have to install the lightGBM model using the below command:

Installing Libraries

!pip install lightgbm==3.3.5

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.




#importing libraries 
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import lightgbm as lgb
  
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
  
import warnings
warnings.filterwarnings('ignore')

Now let’s use the pandas dataframe to load the dataset and then show the first five rows of it.

Loading Dataset and Retriving Information




# Read the data from a CSV file named 'medical_cost.csv' into a DataFrame 'df'.
df = pd.read_csv('medical_cost.csv')
  
# Display the first few rows of the DataFrame to provide a data preview.
df.head()

Output:

   Id  age     sex     bmi  children smoker     region      charges
0 1 19 female 27.900 0 yes southwest 16884.92400
1 2 18 male 33.770 1 no southeast 1725.55230
2 3 28 male 33.000 3 no southeast 4449.46200
3 4 33 male 22.705 0 no northwest 21984.47061
4 5 32 male 28.880 0 no northwest 3866.85520

Here the code reads the contents of the CSV file and loads them into a DataFrame called “df” using the pandas package. The first few rows of the DataFrame are then displayed using the ‘head()’ method to provide a short preview of the data.




#shape of the dataframe
df.shape

Output:

(1338, 8)

Here, the ‘df.shape’ command prints the dimensions of the dataframe.




df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1338 non-null int64
1 age 1338 non-null int64
2 sex 1338 non-null object
3 bmi 1338 non-null float64
4 children 1338 non-null int64
5 smoker 1338 non-null object
6 region 1338 non-null object
7 charges 1338 non-null float64
dtypes: float64(2), int64(3), object(3)
memory usage: 83.8+ KB

By using the df.info() function we can see the content of each columns and the data types present in it along with the number of null values present in each column.




#describing the data
print(df.describe())

Output:

                Id          age          bmi     children       charges
count 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000
mean 669.500000 39.207025 30.663397 1.094918 13270.422265
std 386.391641 14.049960 6.098187 1.205493 12110.011237
min 1.000000 18.000000 15.960000 0.000000 1121.873900
25% 335.250000 27.000000 26.296250 0.000000 4740.287150
50% 669.500000 39.000000 30.400000 1.000000 9382.033000
75% 1003.750000 51.000000 34.693750 2.000000 16639.912515
max 1338.000000 64.000000 53.130000 5.000000 63770.428010

The DataFrame df is described statistically via the df.describe() function. In order to provide a preliminary understanding of the data’s central tendencies and distribution, it includes important statistics such as count, mean, standard deviation, minimum, and maximum values for each numerical column.

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other.

Analyzing the average charges charged by the medical insurance company can be better understood by calculating the average charges based on each categorical column in the dataset.

Categorical Column Mean Analysis




# Define a list of columns to analyze
sample = ['sex', 'children', 'smoker', 'region']
  
# Iterate through the specified columns
for i, col in enumerate(sample):
    # Group the DataFrame by the current column and calculate the mean of 'charges'
    grouped_data = df[[col, 'charges']].groupby(col).mean()
      
    # Print the mean charges for each group within the current column
    print(grouped_data)
      
    # Print a blank line to separate the output for different columns
    print()

Output:

              charges
sex
female 12569.578844
male 13956.751178
charges
children
0 12365.975602
1 12731.171832
2 15073.563734
3 15355.318367
4 13850.656311
5 8786.035247
charges
smoker
no 8434.268298
yes 32050.231832
charges
region
northeast 13406.384516
northwest 12417.575374
southeast 14735.411438
southwest 12346.937377

Here, this code examines a dataset of medical expense information that is displayed as a DataFrame (‘df’). The categorical categories with the most emphasis are those for “sex,” “children,” “smoker,” and “region.” The mean medical charges within each category are calculated by the algorithm after it systematically groups the data by these columns. The output offers insightful information about how these categorical variables affect the typical medical expenses incurred. The use of blank lines between columns improves reading and comprehension and allows for a more thorough explanation of the properties of the dataset. From the above results we can conclude some points from it.

Categorical Column Visualization




# Set the figure size for the subplots
plt.subplots(figsize=(15, 15))
  
# Define the list of columns to visualize
sample = ['sex', 'children', 'smoker', 'region']
  
# Loop through the specified columns
for i, col in enumerate(sample):
    # Create subplots in a 3x2 grid
    plt.subplot(3, 2, i + 1)
      
    # Create a countplot for the current column
    sb.countplot(data=df, x=col)
      
# Adjust subplot layout for better presentation
plt.tight_layout()
  
# Display the subplots
plt.show()

Output:

This code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). The next step iterates over a list of categorical columns (such as “sex,” “children,” “smoker,” and “region”); for each column, it generates a countplot that shows the distribution of values in the DataFrame (abbreviated “df”). Finally, the subplots are presented using ‘plt.show()’, giving a visual summary of the categorical data distributions in the dataset. The ‘plt.tight_layout()’ function is used to improve the layout of subplots for better presentation. From the above output we can conclude the below points:

Distribution Analysis of Numeric Columns




# Set the figure size for the subplots
plt.subplots(figsize=(15, 15))
  
# Define the list of numeric columns to visualize
sample = ['age', 'bmi', 'charges']
  
# Loop through the specified columns
for i, col in enumerate(sample):
    # Create subplots in a 3x2 grid
    plt.subplot(3, 2, i + 1)
      
    # Create a distribution plot for the current numeric column
    sb.distplot(df[col])
      
# Adjust subplot layout for better presentation
plt.tight_layout()
  
# Display the subplots
plt.show()

Output:

Here, this code creates a grid of subplots with a predetermined figure size (3 rows and 2 columns). Then, after iterating through a number of numerical columns (including “age,” “bmi,” and “charges”), it generates a distribution plot (histogram) for each column that shows how values within that particular column of the DataFrame (abbreviated “df”) are distributed. The subplots are then displayed, giving visual insights into the distribution of the data for these numerical aspects. The ‘plt.tight_layout()’ function is used to improve the layout of the subplots for better presentation.

Data Preprocessing

Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:

Log Transformation and Distribution Plot




# Apply the natural logarithm transformation to the 'charges' column
df['charges'] = np.log1p(df['charges'])
  
# Create a distribution plot for the transformed 'charges' column
sb.distplot(df['charges'])
  
# Display the distribution plot
plt.show()

Output:

The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.

One-Hot Encoding Categorical Columns




# Mapping Categorical to Numerical Values
  
# Map 'sex' column values ('male' to 0, 'female' to 1)
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
  
# Map 'smoker' column values ('no' to 0, 'yes' to 1)
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
  
# Display the DataFrame's first few rows to show the transformations
df.head()

Output:

    age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 NaN 27.900 0 NaN 9.734236 0 0
1 18 NaN 33.770 1 NaN 7.453882 0 0
2 28 NaN 33.000 3 NaN 8.400763 0 0
3 33 NaN 22.705 0 NaN 9.998137 0 1
4 32 NaN 28.880 0 NaN 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.

One-hot encoding for “Region” column




# Perform one-hot encoding on the 'region' column
temp = pd.get_dummies(df['region']).astype('int')
  
# Concatenate the one-hot encoded columns with the original DataFrame
df = pd.concat([df, temp], axis=1)

This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.




# Remove 'Id' and 'region' columns from the DataFrame
df.drop(['Id', 'region'], inplace=True, axis=1)
  
# Display the updated DataFrame
print(df.head())

Output:

   age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 1 27.900 0 1 9.734236 0 0
1 18 0 33.770 1 0 7.453882 0 0
2 28 0 33.000 3 0 8.400763 0 0
3 33 0 22.705 0 0 9.998137 0 1
4 32 0 28.880 0 0 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.

Splitting Data




# Define the features
features = df.drop('charges', axis=1)
  
# Define the target variable as 'charges'
target = df['charges']
  
# Split the data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(features, target,
                                                  random_state=2023,
                                                  test_size=0.25)
  
# Display the shapes of the training and validation sets
X_train.shape, X_val.shape

Output:

((1003, 11), (335, 11))

To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.

Feature scaling




# Standardize Features
  
# Use StandardScaler to scale the training and validation data
scaler = StandardScaler()
#Fit the StandardScaler to the training data
scaler.fit(X_train)
# transform both the training and validation data
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.

Dataset Preparation




# Create a LightGBM dataset for training with features X_train and labels Y_train
train_data = lgb.Dataset(X_train, label=Y_train)
  
# Create a LightGBM dataset for testing with features X_val and labels Y_val,
# and specify the reference dataset as train_data for consistent evaluation
test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)

Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.

Regression Model using LightGBM

Now one last thing that is remaining is to define some parameters that we must have to pass for the training process of the model and the arguments that will be used for the same.




# Define a dictionary of parameters for configuring the LightGBM regression model.
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
}

Let’s take a quick look at the parameters that has been passed to the model.

Let’s train the model for 100 epoch on the training data and we will pass the validation data as well to visualize the performance of the model on the unseen data while training process goes on. This helps us to keep a check on the training progress.




# Set the number of rounds and train the model with early stopping
num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[
                test_data], early_stopping_rounds=10)

Output:

[70]    valid_0's rmse: 0.387332
[71] valid_0's rmse: 0.387193
[72] valid_0's rmse: 0.387407
[73] valid_0's rmse: 0.387696
[74] valid_0's rmse: 0.388172
[75] valid_0's rmse: 0.388142
[76] valid_0's rmse: 0.388688
Early stopping, best iteration is:
[66] valid_0's rmse: 0.386691

This code snippet trains a LightGBM model using the supplied parameters (params) and training data (train_data) and sets the number of boosting rounds (num_round). Early stopping is used, where the model keeps track of how it performs on the test data validation dataset and terminates training if no improvement is seen after 10 rounds. This ensures that the model finishes training when it reaches its optimal performance and prevents overfitting. Here we can observe that the Root Mean Square Error value for the validation data is 0.386691 that is a very good score for a regression metric.

Prediction and Evaluation of Model




# Import necessary libraries for calculating mean squared error and using the LightGBM regressor.
from sklearn.metrics import mean_squared_error as mse
from lightgbm import LGBMRegressor
  
# Create an instance of the LightGBM Regressor with the RMSE metric.
model = LGBMRegressor(metric='rmse')
  
# Train the model using the training data.
model.fit(X_train, Y_train)
  
# Make predictions on the training and validation data.
y_train = model.predict(X_train)
y_val = model.predict(X_val)

Here, it utilizes the lightGBM library for regression modelling. The model is trained on the provided training data and predictions are made on both training and validation datasets.

Validation of the Model




# Calculate and print the Root Mean Squared Error (RMSE) for training and validation predictions.
print("Training RMSE: ", np.sqrt(mse(Y_train, y_train)))
print("Validation RMSE: ", np.sqrt(mse(Y_val, y_val)))

Output:

Training RMSE:  0.2331835443343122
Validation RMSE: 0.40587871797893876

Here, this code computes and displayed the RMSE, a measure of prediction accuracy, for both the training and validation datasets. It assesses how well lightGBM regression model performs on the data, with lower RMSE values indicating better model fit.

Conclusion

In conclusion, the utilization of LightGBM for regression tasks presents a robust and efficient approach to predictive modelling. Beginning with data preparation and feature engineering, this process proceeds with the training of a lightGBM regressor model, configured with specific hyperparameters and evaluation metrics. The model’s ability to efficiently learn from the training data, make predictions, and provide interpretability through feature importance scores is a valuable asset. Evaluation metrics like RMSE helps gauge prediction accuracy. LightGBm’s speed, scalability and strong predictive performance make it a compelling choice, particularly handling large datasets and achieving high-quality regression results in various real-world applications.


Article Tags :