Lasso Regression in R Programming

Last Updated : 19 Dec, 2023

Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models(i.e models with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso regression is a regularized regression algorithm that performs L1 regularization which adds a penalty equal to the absolute value of the magnitude of coefficients.

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e variable selection or parameter elimination. Lasso regression solutions are quadratic programming problems that can best be solved with software like RStudio, Matlab, etc. It has the ability to select predictors.

$\text{minimize} \left( \frac{1}{2N} \sum_{i=1}^{N} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right)$

Here:

N: is the number of observations.
p: is the number of predictors.
yi : is the response variable for the i-th observation.
xij :  is the value of the j-th predictor for the i-th observation.
β0: is the intercept term.
βj : are the coefficients for the predictors.
λ : is the regularization parameter controlling the strength of the L1 penalty term.

The algorithm minimizes the sum of squares with constraint. Some Beta are shrunk to zero that results in a regression model. A tuning parameter lambda controls the strength of the L1 regularization penalty. lambda is basically the amount of shrinkage:

When lambda = 0, no parameters are eliminated.
As lambda increases, more and more coefficients are set to zero and eliminated & bias increases.
When lambda = infinity, all coefficients are eliminated.
As lambda decreases, variance increases.

Also, If an intercept is included in the model, it is left unchanged. Now let’s implementing Lasso regression in R programming Language.

Implementation in R

The Dataset

Big Mart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).

Load the required packages

R

# load packages
#for reading and manipulation of data
library(data.table) 
 # used for data manipulation and joining
library(dplyr) 
# used for regression
library(glmnet)
# used for plotting 
library(ggplot2) 
# used for modeling
library(caret) 
# used for building XGBoost model
library(xgboost)  
# used for skewness
library(e1071)  
# used for combining multiple plots 
library(cowplot)    

We will load the all required libraries and now we will load our dataset.

Loading Dataset

R

# Loading data
train = fread("C:\\Users\\GFG19565\\Downloads\\Train.csv")
test = fread("C:\\Users\\GFG19565\\Downloads\\Test.csv")
 
# Structure 
str(train)

Output:

Classes ‘data.table’ and 'data.frame':    8523 obs. of  12 variables:
 $ Item_Identifier          : chr  "FDA15" "DRC01" "FDN15" "FDX07" ...
 $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
 $ Item_Fat_Content         : chr  "Low Fat" "Regular" "Low Fat" "Regular" ...
 $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
 $ Item_Type                : chr  "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
 $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
 $ Outlet_Identifier        : chr  "OUT049" "OUT018" "OUT049" "OUT010" ...
 $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
 $ Outlet_Size              : chr  "Medium" "Medium" "Medium" "" ...
 $ Outlet_Location_Type     : chr  "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
 $ Outlet_Type              : chr  "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store"
 $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...
 - attr(*, ".internal.selfref")=<externalptr>

Performing EDA and Basic operations on dataset

R

# Setting test dataset
# Combining datasets
# add Item_Outlet_Sales to test data
test[, Item_Outlet_Sales := NA] 
 
combi = rbind(train, test)
 
# Missing Value Treatment
missing_index = which(is.na(combi$Item_Weight))
for(i in missing_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Weight[i] = 
      mean(combi$Item_Weight[combi$Item_Identifier == item], 
                                                 na.rm = T)
}
colSums(is.na(combi))

Output:

             Item_Identifier            Item_Outlet_Sales 
                           0                         5681 
                 Item_Weight      Item_Fat_Contentlow fat 
                           0                            0 
     Item_Fat_ContentLow Fat          Item_Fat_Contentreg 
                           0                            0 
     Item_Fat_ContentRegular              Item_Visibility 
                           0                            0 
                    Item_MRP      Outlet_IdentifierOUT013 
                           0                            0 
     Outlet_IdentifierOUT017      Outlet_IdentifierOUT018 
                           0                            0 
     Outlet_IdentifierOUT019      Outlet_IdentifierOUT027 
                           0                            0 
     Outlet_IdentifierOUT035      Outlet_IdentifierOUT045 
                           0                            0 
     Outlet_IdentifierOUT046      Outlet_IdentifierOUT049 
                           0                            0 
Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2 
                           0                            0 
Outlet_TypeSupermarket Type3              Outlet_Size_num 
                           0                            0 
    Outlet_Location_Type_num 
                           0

Here we combine both the data set train and test data and then we check the column wise missing values in combine dataset.

Replacing 0 in Item_Visibility with mean

R

# Replacing 0 in Item_Visibility with mean
zero_index = which(combi$Item_Visibility == 0)
for(i in zero_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Visibility[i] = 
      mean(combi$Item_Visibility[combi$Item_Identifier == item], 
                                                     na.rm = T)
}

This code uses vectorized operations and the ave function to replace zero values in “Item_Visibility” with the mean of the non-zero values for the corresponding “Item_Identifier.”

Perform Encoding

R

combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0,
                                 ifelse(Outlet_Size == "Medium",
                                                        1, 2))]
  
combi[, Outlet_Location_Type_num := 
         ifelse(Outlet_Location_Type == "Tier 3", 0,
                ifelse(Outlet_Location_Type == "Tier 2", 1, 2))]
  
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL]
 
# One Hot Encoding
# To convert categorical in numerical
ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier", 
                                          "Outlet_Establishment_Year",
                                          "Item_Type")], fullRank = T)
ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier", 
                                           "Outlet_Establishment_Year",
                                           "Item_Type")]))
  
combi = cbind(combi[, "Item_Identifier"], ohe_df)
 
head(combi)

Output:

   Item_Identifier Item_Weight Item_Fat_Contentlow fat
1:           FDA15       9.300                       0
2:           DRC01       5.920                       0
3:           FDN15      17.500                       0
4:           FDX07      19.200                       0
5:           NCD19       8.930                       0
6:           FDP36      10.395                       0
   Item_Fat_ContentLow Fat Item_Fat_Contentreg Item_Fat_ContentRegular
1:                       1                   0                       0
2:                       0                   0                       1
3:                       1                   0                       0
4:                       0                   0                       1
5:                       1                   0                       0
6:                       0                   0                       1
   Item_Visibility Item_MRP Outlet_IdentifierOUT013
1:     0.016047301 249.8092                       0
2:     0.019278216  48.2692                       0
3:     0.016760075 141.6180                       0
4:     0.017834338 182.0950                       0
5:     0.009779829  53.8614                       1
6:     0.057058689  51.4008                       0
   Outlet_IdentifierOUT017 Outlet_IdentifierOUT018
1:                       0                       0
2:                       0                       1
3:                       0                       0
4:                       0                       0
5:                       0                       0
6:                       0                       1
   Outlet_IdentifierOUT019 Outlet_IdentifierOUT027
1:                       0                       0
2:                       0                       0
3:                       0                       0
4:                       0                       0
5:                       0                       0
6:                       0                       0
   Outlet_IdentifierOUT035 Outlet_IdentifierOUT045
1:                       0                       0
2:                       0                       0
3:                       0                       0
4:                       0                       0
5:                       0                       0
6:                       0                       0
   Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
1:                       0                       1
2:                       0                       0
3:                       0                       1
4:                       0                       0
5:                       0                       0
6:                       0                       0
   Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2
1:                            1                            0
2:                            0                            1
3:                            1                            0
4:                            0                            0
5:                            1                            0
6:                            0                            1
   Outlet_TypeSupermarket Type3 Item_Outlet_Sales Outlet_Size_num
1:                            0         3735.1380               1
2:                            0          443.4228               1
3:                            0         2097.2700               1
4:                            0          732.3800               2
5:                            0          994.7052               2
6:                            0          556.6088               1
   Outlet_Location_Type_num
1:                        2
2:                        0
3:                        2
4:                        0
5:                        0
6:                        0

This process helps convert categorical information about outlet size and location into a format that machine learning models can understand, where numbers represent different categories. The new numerical variables, “Outlet_Size_num” and “Outlet_Location_Type_num,” are then used in the subsequent analysis.

It transforms categorical variables in the dataset, named combi, into a format suitable for machine learning. It creates binary columns for each category using a process called one-hot encoding.
The transformation is applied to all columns except “Item_Identifier,” “Outlet_Establishment_Year,” and “Item_Type.” The resulting dataset, ohe_df, contains these newly created binary columns. Finally, the original “Item_Identifier” is combined with the one-hot encoded variables to produce an extended dataset for machine learning analysis.

R

# Remove skewness
skewness(combi$Item_Visibility) 
skewness(combi$price_per_unit_wt)
  
# log + 1 to avoid division by zero
combi[, Item_Visibility := log(Item_Visibility + 1)] 
  
# Scaling and Centering data
num_vars = which(sapply(combi, is.numeric)) # index of numeric features
num_vars_names = names(num_vars)
  
combi_numeric = combi[, setdiff(num_vars_names, 
                              "Item_Outlet_Sales"),
                               with = F]
  
prep_num = preProcess(combi_numeric, 
                      method=c("center", "scale"))
combi_numeric_norm = predict(prep_num, combi_numeric)
  
# removing numeric independent variables
combi[, setdiff(num_vars_names, 
                "Item_Outlet_Sales") := NULL]
combi = cbind(combi, combi_numeric_norm)
  
# splitting data back to train and test
train = combi[1:nrow(train)]
test = combi[(nrow(train) + 1):nrow(combi)]
  
# Removing Item_Outlet_Sales
test[, Item_Outlet_Sales := NULL]

In this code we convert skewness in numeric variables within the dataset, combi. It calculates and logs transformations for two numeric features, “Item_Visibility” and “price_per_unit_wt,” aiming to reduce skewness.

The data is then scaled and centered using z-score normalization. Numeric independent variables are separated for preprocessing, and the preProcess function is applied for centering and scaling. The transformed numeric features are integrated back into the dataset. Finally, the data is split into training and testing sets, and the target variable “Item_Outlet_Sales” is removed from the test set in preparation for machine learning analysis.

Create Lasso Regression Model

R

set.seed(123)
control = trainControl(method ="cv", number = 5)
Grid_la_reg = expand.grid(alpha = 1,
              lambda = seq(0.001, 0.1, by = 0.0002))
  
# Training lasso regression model
lasso_model = train(x = train[, -c("Item_Identifier", 
                               "Item_Outlet_Sales")],
                    y = train$Item_Outlet_Sales,
                    method = "glmnet",
                    trControl = control,
                    tuneGrid = Grid_la_reg
                    )
lasso_model

Output:

glmnet 

8523 samples
  21 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 6819, 6819, 6819, 6817, 6818 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE    
  0.0010  1129.244  0.5623967  837.116
  0.0012  1129.244  0.5623967  837.116
  0.0014  1129.244  0.5623967  837.116
  0.0016  1129.244  0.5623967  837.116
  0.0018  1129.244  0.5623967  837.116
  0.0020  1129.244  0.5623967  837.116
  0.0022  1129.244  0.5623967  837.116
  0.0024  1129.244  0.5623967  837.116
  0.0026  1129.244  0.5623967  837.116
  0.0028  1129.244  0.5623967  837.116.........

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.

This results are from applying the glmnet algorithm to a dataset with 8523 samples and 21 predictors. The algorithm is tuned using cross-validated resampling with 5 folds, which helps evaluate its performance more robustly.

The table displays the root mean squared error (RMSE), R-squared value, and mean absolute error (MAE) across different regularization parameter values (lambda). The goal is to find the lambda value that minimizes prediction errors. In this snippet, results for lambda ranging from 0.001 to 0.0028 are shown, indicating that, regardless of lambda.

The RMSE, R-squared, and MAE values remain consistent, suggesting a stable performance for the glmnet model on this data.

Calculate mean validation score and summary

R

# Calculate the summary
summary(lasso_model)
# mean validation score
mean(lasso_model$resample$RMSE)

Output:


            Length Class      Mode     
a0            68   -none-     numeric  
beta        1428   dgCMatrix  S4       
df            68   -none-     numeric  
dim            2   -none-     numeric  
lambda        68   -none-     numeric  
dev.ratio     68   -none-     numeric  
nulldev        1   -none-     numeric  
npasses        1   -none-     numeric  
jerr           1   -none-     numeric  
offset         1   -none-     logical  
call           5   -none-     call     
nobs           1   -none-     numeric  
lambdaOpt      1   -none-     numeric  
xNames        21   -none-     character
problemType    1   -none-     character
tuneValue      2   data.frame list     
obsLevels      1   -none-     logical  
param          0   -none-     list 

mean(lasso_model$resample$RMSE)
[1] 1129.244

The ‘a0’ is a numeric vector of length 68, representing the intercept term in the model.
‘beta’ is a sparse matrix (dgCMatrix class) with 1428 elements, indicating the coefficients for each predictor.
‘df’ is a numeric vector of length 68, denoting the degrees of freedom for each lambda value.
‘dim’ is a numeric vector of length 2, providing the dimensions of the ‘beta’ matrix.
‘lambda’ is a numeric vector of length 68, representing the regularization parameter values.
‘dev.ratio’ is a numeric vector of length 68, indicating the ratio of deviances for each lambda value.

Visualize the Lasso Model

R

# Extracting coefficients from the lasso model
lasso_coefs <- as.matrix(coef(lasso_model$finalModel, s = lasso_model$bestTune$lambda))
 
# Creating a data frame for plotting
lasso_coefs_df <- data.frame(
  Predictor = colnames(lasso_coefs),
  Coefficient = as.numeric(lasso_coefs)
)
 
# Filtering out zero coefficients
lasso_coefs_df <- lasso_coefs_df[lasso_coefs_df$Coefficient != 0, ]
 
# Plotting lasso coefficients
library(ggplot2)
lasso_plot <- ggplot(lasso_coefs_df, aes(x = reorder(Predictor, -Coefficient), 
                                         y = Coefficient)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  coord_flip() +
  labs(title = "Lasso Regression Coefficients",
       x = "Predictor",
       y = "Coefficient") +
  theme_minimal()
 
# Displaying the plot
print(lasso_plot)

Output:

Lasso Regression in R Programming

The lasso model coefficients are extracted using the coef function, considering the optimal lambda value determined during model tuning.

The coefficients are organized into a data frame for plotting, including the predictor names and their corresponding coefficients.
Coefficients that are precisely zero are filtered out, focusing on the relevant predictors with non-zero coefficients.
The non-zero coefficients are presented in a horizontal bar plot, with the predictors reordered by the magnitude of their coefficients.
The plot is with blue bars, black borders, and minimal theme design for clarity.

The resulting plot provides a visual representation of the lasso regression coefficients, aiding in the interpretation of predictor importance in the model.

The regularization parameter increases, RMSE remains constant.

Suggest improvement

Polynomial Regression in R Programming

Share your thoughts in the comments

Lasso Regression in R Programming

Implementation in R

Load the required packages

R

Loading Dataset

R

Performing EDA and Basic operations on dataset

R

Replacing 0 in Item_Visibility with mean

R

Perform Encoding

R

R

Create Lasso Regression Model

R

Calculate mean validation score and summary

R

Visualize the Lasso Model

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?