Open In App

How does L1 and L2 regularization prevent overfitting?

Last Updated : 29 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Overfitting is a recurring problem in machine learning that can harm a model’s capacity to perform well and be generalized. Regularization is a useful tactic for addressing this problem since it keeps models from becoming too complicated and, thus, too customized to the training set. L1 and L2, two widely used regularization techniques, provide different solutions for this issue. In this article, we will be exploring how does regularization prevents overfitting.

How do we avoid Overfitting?

Overfitting occurs when a machine learning model learns the training data too well, to the extent that it starts to memorize noise and random fluctuations in the data rather than capturing the underlying patterns. This can result in poor performance when the model is applied to new, unseen data. Essentially, it’s like a student who memorizes the answers to specific questions without truly understanding the material, and then struggles when faced with new questions or scenarios. Avoiding overfitting is crucial in developing robust and generalizable machine learning models.

To improve a model’s performance, various techniques can be applied. These include methods like dropout, which randomly removes neurons during training, adaptive regularization to adjust regularization strength based on data, and early stopping to halt training when performance plateaus, along with experimenting with different architectures and applying L1 or L2 regularization for controlling overfitting. Here, we will emphasize on L1 and L2 regularization.

How does L1, and L2 regularization prevent overfitting?

L1 regularization, or Lasso regularization, introduces a penalty term based on the absolute values of the weights into the model’s cost function. This penalty encourages the model to prioritize a smaller set of significant features, aiding in feature selection. By reducing feature complexity, L1 regularization helps prevent overfitting.

We can represent the modified loss function as:

[Tex]L_{L1} – L_{original} + \lambda \sum_{i=1}^{n}|w_i|[/Tex]

Here,

  • [Tex]L_{L1}[/Tex] is the new loss function with L1 regularization.
  • [Tex]L_{orginal}[/Tex] is the original loss function without regularization.
  • [Tex]\lambda[/Tex] is the regularization parameter
  • n is the number of features
  • [Tex]w_i[/Tex] are the coefficients of the features.

The term [Tex]\lambda \sum_{i=1}^{n}|w_i|[/Tex]penalizes large coefficients by adding their absolute values to the loss function.

L2 regularization, also known as Ridge regularization, incorporates a penalty term proportional to the square of the weights into the model’s cost function. This encourages the model to evenly distribute weights across all features, preventing overreliance on any single feature and thereby reducing overfitting.

We can represent the modified loss function as:

[Tex]L_{L2} = L_{original} + \lambda \sum_{i=1}^{n}|w_i^{2}|[/Tex]

Here,

  • [Tex]L_{L2}[/Tex] is the new loss function with L2 regularization
  • [Tex]L_{original}[/Tex] is the original loss function without regularization
  • [Tex]\lambda[/Tex] is the regularization parameter
  • n is the number of features
  • [Tex]w_i[/Tex] are the coefficients of the features

The term [Tex]\lambda \sum_{i=1}^{n} w_{i}^{2}[/Tex]​ penalizes large coefficients by adding their squared values to the loss function.

In essence, both L1 and L2 regularization techniques counter overfitting by simplifying the model and promoting more balanced weight distribution across features.

L1 Vs L2 regularization


L1 Regularization (Lasso)L2 Regularization (Ridge)
Advantages



Feature selection: Encourages sparse models by driving irrelevant feature weights to zero.Smooths model: Encourages more balanced weight distribution across features, reducing over-reliance on any single feature.
Robust to outliers: Due to the absolute penalty, L1 regularization is less sensitive to outliers.Better for multicollinear features: Handles multicollinearity well by distributing weights evenly among correlated features.
Interpretable models: Produces simpler, more interpretable models by emphasizing important features. Generally stable: Offers more stability in the presence of correlated predictors.
Disadvantages



Non-differentiable at zero: Can have issues in optimization due to non-differentiability at zero, requiring specialized optimization techniques. Doesn’t perform feature selection: Does not drive any weights exactly to zero, leading to less sparse models.
May shrink coefficients too much: In some cases, L1 regularization may excessively shrink coefficients, leading to underfitting.Not robust to outliers: Can be sensitive to outliers due to the squared penalty term, potentially affecting model performance.
Works poorly with correlated features: May arbitrarily select one feature over another when features are highly correlated.Less interpretable models: Ridge regression tends to keep all features in the model, which can make interpretation more challenging.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads