Open In App

Weighted Logistic Regression for Imbalanced Dataset

Last Updated : 11 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In real-world datasets, it’s common to encounter class imbalance, where one class significantly outnumbers the other(s). This class imbalance poses challenges for machine learning models, particularly for classification tasks, as models tend to be biased towards the majority class, leading to suboptimal performance.

What are imbalanced datasets?

Imbalanced datasets refer to datasets where the distribution of instances across different classes is skewed or uneven. In other words, one class (the majority class) has significantly more examples than one or more other classes (the minority class or classes).

What is Weighted Logistic Regression?

  • Weighted logistic regression is an extension of standard logistic regression that allows for the incorporation of sample weights into the model.
  • In logistic regression, the goal is to model the probability that a binary outcome (e.g., success or failure) occurs as a function of one or more predictor variables. This is typically done using maximum likelihood estimation.

How weighted Logistic Regression is used for an Imbalanced Dataset?

Weighted logistic regression is a technique commonly employed to address the issue of imbalanced datasets in logistic regression models. In imbalanced datasets, where the classes of interest are not equally represented, traditional logistic regression models may exhibit bias towards the majority class, leading to suboptimal performance, especially for predicting rare events.

Here’s how weighted logistic regression works and how it can be used to handle imbalanced datasets:

  1. Understanding Imbalanced Datasets: In imbalanced datasets, one class (majority class) is significantly more prevalent than the other class(es) (minority class). For instance, in a medical dataset, the number of healthy patients might outnumber the number of patients with a rare disease by a large margin.
  2. The Problem with Traditional Logistic Regression: Traditional logistic regression treats all classes equally during model training. Consequently, when faced with imbalanced datasets, the model tends to be biased towards the majority class. As a result, it may have lower sensitivity (true positive rate) for the minority class, leading to poor performance in predicting rare events.
  3. Weighted Logistic Regression: Weighted logistic regression addresses this issue by assigning different weights to each class based on their prevalence in the dataset. The weights are incorporated into the loss function during model training. By assigning higher weights to the minority class and lower weights to the majority class, the model is encouraged to pay more attention to the minority class, thereby reducing the bias towards the majority class.
  4. Training the Weighted Logistic Regression Model: During model training, the weighted logistic regression algorithm adjusts the model parameters to minimize the weighted sum of errors, where errors from the minority class are given higher weights. This encourages the model to focus on correctly classifying instances from the minority class, improving its ability to predict rare events.
  5. Evaluation and Fine-Tuning: After training, the weighted logistic regression model is evaluated using appropriate performance metrics, such as precision, recall, F1-score, and area under the ROC curve (AUC-ROC), considering the imbalanced nature of the dataset. Depending on the evaluation results, the model may be fine-tuned further by adjusting the class weights or other hyperparameters to achieve better performance.
  6. Applications: Weighted logistic regression is widely used in various domains, including healthcare, finance, fraud detection, and anomaly detection, where correctly identifying rare events or minority classes is crucial.

Mathematical Concepts Used in Weighted Logistic Regression

  • The core mathematical concept behind Weighted Logistic Regression lies in modifying the logistic regression algorithm to incorporate weights into the calculation of the loss function. The loss function measures how well the model’s predictions match the actual data. In standard logistic regression, each instance in the dataset contributes equally to the loss, regardless of its class. In contrast, Weighted Logistic Regression adjusts this contribution based on the assigned weight of each class.
  • Mathematically, this is achieved by adding a weight multiplier to the loss function for each data point. This weight influences the gradient of the loss function during the optimization process, which is how the model learns.
  • If a data point belongs to the minority class, its weight is increased, making any error in predicting it have a larger impact on the model’s learning process. This encourages the model to adjust its parameters in a way that improves its accuracy on the minority class, thereby addressing the imbalance.

The beauty of this approach is its simplicity and flexibility. By tuning the weights, one can control the balance between precision and recall, optimizing the model’s performance for the specific needs of any given project or domain.

Implementation To Show how Weighted Logistic Regression works for Imbalanced Dataset

Importing Neccessary Libraries

Python3




import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


Generating and Splitting the Dataset

Python3




# Generate a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
                           n_redundant=10, weights=[0.95, 0.05], random_state=42)
 
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Training both the models

Python3




# Fit a standard logistic regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
 
# Make predictions with the standard model
y_pred_lr = lr_model.predict(X_test)
 
# Evaluate the standard model
print("Logistic Regression:")
print(classification_report(y_test, y_pred_lr))
 
# Fit a weighted logistic regression model
weights = {0: 1, 1: 10# Weight 10 for class 1 (minority class)
weighted_lr_model = LogisticRegression(random_state=42, class_weight=weights)
weighted_lr_model.fit(X_train, y_train)
 
# Make predictions with the weighted model
y_pred_weighted_lr = weighted_lr_model.predict(X_test)
 
# Evaluate the weighted model
print("Weighted Logistic Regression:")
print(classification_report(y_test, y_pred_weighted_lr))


Output:

Logistic Regression:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98       287
           1       0.50      0.15      0.24        13

    accuracy                           0.96       300
   macro avg       0.73      0.57      0.61       300
weighted avg       0.94      0.96      0.95       300

Weighted Logistic Regression:
              precision    recall  f1-score   support

           0       0.98      0.90      0.94       287
           1       0.22      0.62      0.33        13

    accuracy                           0.89       300
   macro avg       0.60      0.76      0.63       300
weighted avg       0.95      0.89      0.91       300
  • Logistic regression has high precision and recall for the majority class (0), but low values for the minority class (1), indicating a bias towards the majority class.
  • Weighted logistic regression improves the recall for the minority class (1) at the expense of precision, resulting in a lower overall accuracy but a better balance between the two classes.

In summary, weighted logistic regression can be beneficial for imbalanced datasets by improving the performance on the minority class, as shown by the higher recall for class 1 compared to standard logistic regression.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads