Open In App

Lasso Regression in R Programming

Last Updated : 19 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models(i.e models with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso regression is a regularized regression algorithm that performs L1 regularization which adds a penalty equal to the absolute value of the magnitude of coefficients.

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e variable selection or parameter elimination. Lasso regression solutions are quadratic programming problems that can best be solved with software like RStudio, Matlab, etc. It has the ability to select predictors. 

\text{minimize} \left( \frac{1}{2N} \sum_{i=1}^{N} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right)

Here:

  • N: is the number of observations.
  • p: is the number of predictors.
  • yi : is the response variable for the i-th observation.
  • xij : ​ is the value of the j-th predictor for the i-th observation.
  • β0​: is the intercept term.
  • βj : are the coefficients for the predictors.
  • λ : is the regularization parameter controlling the strength of the L1 penalty term.

The algorithm minimizes the sum of squares with constraint. Some Beta are shrunk to zero that results in a regression model. A tuning parameter lambda controls the strength of the L1 regularization penalty. lambda is basically the amount of shrinkage: 

  • When lambda = 0, no parameters are eliminated. 
  • As lambda increases, more and more coefficients are set to zero and eliminated & bias increases. 
  • When lambda = infinity, all coefficients are eliminated. 
  • As lambda decreases, variance increases. 

Also, If an intercept is included in the model, it is left unchanged. Now let’s implementing Lasso regression in R programming Language.

Implementation in R

The Dataset

Big Mart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).

Load the required packages

R

# load packages
#for reading and manipulation of data
library(data.table)
 # used for data manipulation and joining
library(dplyr)
# used for regression
library(glmnet)
# used for plotting
library(ggplot2)
# used for modeling
library(caret)
# used for building XGBoost model
library(xgboost) 
# used for skewness
library(e1071) 
# used for combining multiple plots
library(cowplot)   

                    

We will load the all required libraries and now we will load our dataset.

Loading Dataset

R

# Loading data
train = fread("C:\\Users\\GFG19565\\Downloads\\Train.csv")
test = fread("C:\\Users\\GFG19565\\Downloads\\Test.csv")
 
# Structure
str(train)

                    

Output: 

Classes ‘data.table’ and 'data.frame':    8523 obs. of  12 variables:
$ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : chr "Medium" "Medium" "Medium" "" ...
$ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
$ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store"
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
- attr(*, ".internal.selfref")=<externalptr>

Performing EDA and Basic operations on dataset

R

# Setting test dataset
# Combining datasets
# add Item_Outlet_Sales to test data
test[, Item_Outlet_Sales := NA]
 
combi = rbind(train, test)
 
# Missing Value Treatment
missing_index = which(is.na(combi$Item_Weight))
for(i in missing_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Weight[i] =
      mean(combi$Item_Weight[combi$Item_Identifier == item],
                                                 na.rm = T)
}
colSums(is.na(combi))

                    

Output:

             Item_Identifier            Item_Outlet_Sales 
0 5681
Item_Weight Item_Fat_Contentlow fat
0 0
Item_Fat_ContentLow Fat Item_Fat_Contentreg
0 0
Item_Fat_ContentRegular Item_Visibility
0 0
Item_MRP Outlet_IdentifierOUT013
0 0
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018
0 0
Outlet_IdentifierOUT019 Outlet_IdentifierOUT027
0 0
Outlet_IdentifierOUT035 Outlet_IdentifierOUT045
0 0
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
0 0
Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2
0 0
Outlet_TypeSupermarket Type3 Outlet_Size_num
0 0
Outlet_Location_Type_num
0

Here we combine both the data set train and test data and then we check the column wise missing values in combine dataset.

Replacing 0 in Item_Visibility with mean

R

# Replacing 0 in Item_Visibility with mean
zero_index = which(combi$Item_Visibility == 0)
for(i in zero_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Visibility[i] =
      mean(combi$Item_Visibility[combi$Item_Identifier == item],
                                                     na.rm = T)
}

                    

This code uses vectorized operations and the ave function to replace zero values in “Item_Visibility” with the mean of the non-zero values for the corresponding “Item_Identifier.”

Perform Encoding

R

combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0,
                                 ifelse(Outlet_Size == "Medium",
                                                        1, 2))]
  
combi[, Outlet_Location_Type_num :=
         ifelse(Outlet_Location_Type == "Tier 3", 0,
                ifelse(Outlet_Location_Type == "Tier 2", 1, 2))]
  
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL]
 
# One Hot Encoding
# To convert categorical in numerical
ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier",
                                          "Outlet_Establishment_Year",
                                          "Item_Type")], fullRank = T)
ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier",
                                           "Outlet_Establishment_Year",
                                           "Item_Type")]))
  
combi = cbind(combi[, "Item_Identifier"], ohe_df)
 
head(combi)

                    

Output:

   Item_Identifier Item_Weight Item_Fat_Contentlow fat
1: FDA15 9.300 0
2: DRC01 5.920 0
3: FDN15 17.500 0
4: FDX07 19.200 0
5: NCD19 8.930 0
6: FDP36 10.395 0
Item_Fat_ContentLow Fat Item_Fat_Contentreg Item_Fat_ContentRegular
1: 1 0 0
2: 0 0 1
3: 1 0 0
4: 0 0 1
5: 1 0 0
6: 0 0 1
Item_Visibility Item_MRP Outlet_IdentifierOUT013
1: 0.016047301 249.8092 0
2: 0.019278216 48.2692 0
3: 0.016760075 141.6180 0
4: 0.017834338 182.0950 0
5: 0.009779829 53.8614 1
6: 0.057058689 51.4008 0
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018
1: 0 0
2: 0 1
3: 0 0
4: 0 0
5: 0 0
6: 0 1
Outlet_IdentifierOUT019 Outlet_IdentifierOUT027
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
Outlet_IdentifierOUT035 Outlet_IdentifierOUT045
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
1: 0 1
2: 0 0
3: 0 1
4: 0 0
5: 0 0
6: 0 0
Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2
1: 1 0
2: 0 1
3: 1 0
4: 0 0
5: 1 0
6: 0 1
Outlet_TypeSupermarket Type3 Item_Outlet_Sales Outlet_Size_num
1: 0 3735.1380 1
2: 0 443.4228 1
3: 0 2097.2700 1
4: 0 732.3800 2
5: 0 994.7052 2
6: 0 556.6088 1
Outlet_Location_Type_num
1: 2
2: 0
3: 2
4: 0
5: 0
6: 0

This process helps convert categorical information about outlet size and location into a format that machine learning models can understand, where numbers represent different categories. The new numerical variables, “Outlet_Size_num” and “Outlet_Location_Type_num,” are then used in the subsequent analysis.

  • It transforms categorical variables in the dataset, named combi, into a format suitable for machine learning. It creates binary columns for each category using a process called one-hot encoding.
  • The transformation is applied to all columns except “Item_Identifier,” “Outlet_Establishment_Year,” and “Item_Type.” The resulting dataset, ohe_df, contains these newly created binary columns. Finally, the original “Item_Identifier” is combined with the one-hot encoded variables to produce an extended dataset for machine learning analysis.

R

# Remove skewness
skewness(combi$Item_Visibility)
skewness(combi$price_per_unit_wt)
  
# log + 1 to avoid division by zero
combi[, Item_Visibility := log(Item_Visibility + 1)]
  
# Scaling and Centering data
num_vars = which(sapply(combi, is.numeric)) # index of numeric features
num_vars_names = names(num_vars)
  
combi_numeric = combi[, setdiff(num_vars_names,
                              "Item_Outlet_Sales"),
                               with = F]
  
prep_num = preProcess(combi_numeric,
                      method=c("center", "scale"))
combi_numeric_norm = predict(prep_num, combi_numeric)
  
# removing numeric independent variables
combi[, setdiff(num_vars_names,
                "Item_Outlet_Sales") := NULL]
combi = cbind(combi, combi_numeric_norm)
  
# splitting data back to train and test
train = combi[1:nrow(train)]
test = combi[(nrow(train) + 1):nrow(combi)]
  
# Removing Item_Outlet_Sales
test[, Item_Outlet_Sales := NULL]

                    

In this code we convert skewness in numeric variables within the dataset, combi. It calculates and logs transformations for two numeric features, “Item_Visibility” and “price_per_unit_wt,” aiming to reduce skewness.

The data is then scaled and centered using z-score normalization. Numeric independent variables are separated for preprocessing, and the preProcess function is applied for centering and scaling. The transformed numeric features are integrated back into the dataset. Finally, the data is split into training and testing sets, and the target variable “Item_Outlet_Sales” is removed from the test set in preparation for machine learning analysis.

Create Lasso Regression Model

R

set.seed(123)
control = trainControl(method ="cv", number = 5)
Grid_la_reg = expand.grid(alpha = 1,
              lambda = seq(0.001, 0.1, by = 0.0002))
  
# Training lasso regression model
lasso_model = train(x = train[, -c("Item_Identifier",
                               "Item_Outlet_Sales")],
                    y = train$Item_Outlet_Sales,
                    method = "glmnet",
                    trControl = control,
                    tuneGrid = Grid_la_reg
                    )
lasso_model

                    

Output:

glmnet 

8523 samples
21 predictor

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6819, 6819, 6819, 6817, 6818
Resampling results across tuning parameters:

lambda RMSE Rsquared MAE
0.0010 1129.244 0.5623967 837.116
0.0012 1129.244 0.5623967 837.116
0.0014 1129.244 0.5623967 837.116
0.0016 1129.244 0.5623967 837.116
0.0018 1129.244 0.5623967 837.116
0.0020 1129.244 0.5623967 837.116
0.0022 1129.244 0.5623967 837.116
0.0024 1129.244 0.5623967 837.116
0.0026 1129.244 0.5623967 837.116
0.0028 1129.244 0.5623967 837.116.........

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.

This results are from applying the glmnet algorithm to a dataset with 8523 samples and 21 predictors. The algorithm is tuned using cross-validated resampling with 5 folds, which helps evaluate its performance more robustly.

The table displays the root mean squared error (RMSE), R-squared value, and mean absolute error (MAE) across different regularization parameter values (lambda). The goal is to find the lambda value that minimizes prediction errors. In this snippet, results for lambda ranging from 0.001 to 0.0028 are shown, indicating that, regardless of lambda.

The RMSE, R-squared, and MAE values remain consistent, suggesting a stable performance for the glmnet model on this data.

Calculate mean validation score and summary

R

# Calculate the summary
summary(lasso_model)
# mean validation score
mean(lasso_model$resample$RMSE)

                    

Output:


Length Class Mode
a0 68 -none- numeric
beta 1428 dgCMatrix S4
df 68 -none- numeric
dim 2 -none- numeric
lambda 68 -none- numeric
dev.ratio 68 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 5 -none- call
nobs 1 -none- numeric
lambdaOpt 1 -none- numeric
xNames 21 -none- character
problemType 1 -none- character
tuneValue 2 data.frame list
obsLevels 1 -none- logical
param 0 -none- list

mean(lasso_model$resample$RMSE)
[1] 1129.244
  • The ‘a0’ is a numeric vector of length 68, representing the intercept term in the model.
  • ‘beta’ is a sparse matrix (dgCMatrix class) with 1428 elements, indicating the coefficients for each predictor.
  • ‘df’ is a numeric vector of length 68, denoting the degrees of freedom for each lambda value.
  • ‘dim’ is a numeric vector of length 2, providing the dimensions of the ‘beta’ matrix.
  • ‘lambda’ is a numeric vector of length 68, representing the regularization parameter values.
  • ‘dev.ratio’ is a numeric vector of length 68, indicating the ratio of deviances for each lambda value.

Visualize the Lasso Model

R

# Extracting coefficients from the lasso model
lasso_coefs <- as.matrix(coef(lasso_model$finalModel, s = lasso_model$bestTune$lambda))
 
# Creating a data frame for plotting
lasso_coefs_df <- data.frame(
  Predictor = colnames(lasso_coefs),
  Coefficient = as.numeric(lasso_coefs)
)
 
# Filtering out zero coefficients
lasso_coefs_df <- lasso_coefs_df[lasso_coefs_df$Coefficient != 0, ]
 
# Plotting lasso coefficients
library(ggplot2)
lasso_plot <- ggplot(lasso_coefs_df, aes(x = reorder(Predictor, -Coefficient),
                                         y = Coefficient)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  coord_flip() +
  labs(title = "Lasso Regression Coefficients",
       x = "Predictor",
       y = "Coefficient") +
  theme_minimal()
 
# Displaying the plot
print(lasso_plot)

                    

Output:


gh

Lasso Regression in R Programming


The lasso model coefficients are extracted using the coef function, considering the optimal lambda value determined during model tuning.

  • The coefficients are organized into a data frame for plotting, including the predictor names and their corresponding coefficients.
  • Coefficients that are precisely zero are filtered out, focusing on the relevant predictors with non-zero coefficients.
  • The non-zero coefficients are presented in a horizontal bar plot, with the predictors reordered by the magnitude of their coefficients.
  • The plot is with blue bars, black borders, and minimal theme design for clarity.

The resulting plot provides a visual representation of the lasso regression coefficients, aiding in the interpretation of predictor importance in the model.

The regularization parameter increases, RMSE remains constant.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads