Open In App

Handling Imbalanced Data for Classification

Last Updated : 02 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

A key component of machine learning classification tasks is handling unbalanced data, which is characterized by a skewed class distribution with a considerable overrepresentation of one class over the others. The difficulty posed by this imbalance is that models may exhibit inferior performance due to bias towards the majority class. When faced with uneven settings, the model’s bias is to value accuracy over accurately recognizing occurrences of minority classes.

This problem can be solved by applying specialized strategies like resampling (oversampling minority class, undersampling majority class), utilizing various assessment measures (F1-score, precision, recall), and putting advanced algorithms to work with unbalanced datasets into practice.

What is Imbalanced Data and How to handle it?

Imbalanced data pertains to datasets where the distribution of observations in the target class is uneven. In other words, one class label has a significantly higher number of observations, while the other has a notably lower count.

When one class greatly outnumbers the others in a classification, there is imbalanced data. Machine learning models may become biased in their predictions as a result, favoring the majority class. Techniques like oversampling the minority class or undersampling the majority class are used in resampling to remedy this.

Furthermore, it is possible to evaluate model performance more precisely by substituting other assessment measures, such as precision, recall, or F1-score, for accuracy. To further improve the handling of imbalanced datasets for more reliable and equitable predictions, specialized techniques such as ensemble approaches and the incorporation of synthetic data generation can be used.

Problem with Handling Imbalanced Data for Classification

  • Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
  • Minority class observations look like noise to the model and are ignored by the model.
  • Imbalanced dataset gives misleading accuracy score.

Ways to handle Imbalanced Data for Classification

Addressing imbalanced data in classification is crucial for fair model performance. Techniques include resampling (oversampling or undersampling), synthetic data generation, specialized algorithms, and alternative evaluation metrics. Implementing these strategies ensures more accurate and unbiased predictions across all classes.

1. Different Evaluation Metric

Classifier accuracy is calculated by dividing the total correct predictions by the overall predictions, suitable for balanced classes but less effective for imbalanced datasets. Precision gauges the accuracy of a classifier in predicting a specific class, while recall assesses its ability to correctly identify a class. In imbalanced datasets, the F1 score emerges as a preferred metric, striking a balance between precision and recall, providing a more comprehensive evaluation of a classifier’s performance. It can be expressed as the mean of recall and accuracy.

F1 = 2 * \frac{precision\; *\; recall}{precision\; +\; recall}

Precision and F1 score both decrease when the classifier incorrectly predict the minority class, increasing the number of false positives. Recall and F1 score also drop if the classifier have trouble accurately identifying the minority class, leading to more false negatives. In particular, the F1 score only becomes better when the amount and accuracy of predictions get better.

F1 score is essentially a comprehensive statistic that takes into account the trade-off between precision and recall, which is critical for assessing classifier performance in datasets that are imbalanced.

2. Resampling (Undersampling and Oversampling)

This method involves adjusting the balance between minority and majority classes through upsampling or downsampling. In the case of an imbalanced dataset, oversampling the minority class with replacement, termed oversampling, is employed. Conversely, undersampling entails randomly removing rows from the majority class to align with the minority class.

This sampling approach yields a balanced dataset, ensuring comparable representation for both majority and minority classes. Achieving a similar number of records for both classes in the dataset signifies that the classifier will assign equal importance to each class during training.

Python3

import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
print("Original class distribution:", Counter(y))
 
# Oversampling using RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample.fit_resample(X, y)
print("Oversampled class distribution:", Counter(y_over))
 
 
# Undersampling using RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersample.fit_resample(X, y)
print("Undersampled class distribution:", Counter(y_under))

                    

Output:

Original class distribution: Counter({1: 900, 0: 100})
Oversampled class distribution: Counter({1: 900, 0: 900})
Undersampled class distribution: Counter({0: 100, 1: 100})

3. BalancedBaggingClassifier

When dealing with imbalanced datasets, traditional classifiers tend to favor the majority class, neglecting the minority class due to its lower representation. The BalancedBaggingClassifier, an extension of sklearn classifiers, addresses this imbalance by incorporating additional balancing during training. It introduces parameters like “sampling_strategy,” determining the type of resampling (e.g., ‘majority’ for resampling only the majority class, ‘all’ for resampling all classes), and “replacement,” dictating whether the sampling should occur with or without replacement. This classifier ensures a more equitable treatment of classes, particularly beneficial when handling imbalanced datasets.

Importing Libraries

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.metrics import accuracy_score, classification_report

                    

This code demonstrates the usage of a BalancedBaggingClassifier from the imbalanced-learn library to handle imbalanced datasets. It creates an imbalanced dataset, splits it, and trains a Random Forest classifier with balanced bagging, assessing the model’s performance through accuracy and a classification report.

Creating imbalanced dataset and splitting

Python3

# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

                    

This code creates two-class, imbalanced datasets, divides them into training and testing sets, and uses a predetermined random state to guarantee reproducibility. With 20 features in the final dataset, the minority class has a weight of 0.1, indicating a notable class imbalance.

Creating a random forest classifier

Python3

# Create a Random Forest Classifier (you can use any classifier)
base_classifier = RandomForestClassifier(random_state=42)

                    

By initializing a Random Forest classifier with a given random state, this method creates a base classifier that may be used in subsequent analyses. Reproducibility in model training is guaranteed by the random state.

Creating a balanced bagging classifier

Python3

# Create a BalancedBaggingClassifier
balanced_bagging_classifier = BalancedBaggingClassifier(base_classifier,
                                                        sampling_strategy='auto'# You can adjust this parameter
                                                        replacement=False# Whether to sample with or without replacement
                                                        random_state=42)

                    

This code creates a BalancedBaggingClassifier by starting with a RandomForestClassifier that was previously defined. A random state is established for reproducibility, and options like “sampling_strategy” and “replacement” are supplied to address class imbalance.

Fitting the model and making predictions

Python3

# Fit the model
balanced_bagging_classifier.fit(X_train, y_train)
 
# Make predictions
y_pred = balanced_bagging_classifier.predict(X_test)

                    

This code use the training data (X_train, y_train) to train the BalancedBaggingClassifier. Then, using the test data (X_test), they predict the labels, saving the results in the variable y_pred.

Evaluation

Python3

# Evaluate the performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

                    

Output:

Accuracy: 1.0
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 187
accuracy 1.00 200
macro avg 1.00 1.00 1.00 200
weighted avg 1.00 1.00 1.00 200

This code compute and output the balanced bagging classifier’s accuracy on the test set. Furthermore, a comprehensive classification report with information on each class’s F1-score, recall, and precision is printed.

4. SMOTE

The Synthetic Minority Oversampling Technique (SMOTE) addresses imbalanced datasets by synthetically generating new instances for the minority class. Unlike simply duplicating records, SMOTE enhances diversity by creating artificial instances. In simpler terms, SMOTE examines instances in the minority class, selects a random nearest neighbor using k-nearest neighbors, and generates a synthetic instance randomly within the feature space.

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
 
# Display class distribution before SMOTE
print("Class distribution before SMOTE:", Counter(y_train))
 
# Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
 
# Display class distribution after SMOTE
print("Class distribution after SMOTE:", Counter(y_train_resampled))

                    

Output:

Class distribution before SMOTE: Counter({1: 713, 0: 87})
Class distribution after SMOTE: Counter({1: 713, 0: 713})

This code demonstrates how to rectify class imbalance in a dataset using SMOTE. Initially, an unbalanced dataset is produced, with 10% of the data belonging to a minority class. It shows the class distribution before to SMOTE after dividing the data into training and testing sets. After that, the minority class is oversampled using the SMOTE approach to produce synthetic instances. It displays a more equal representation of both classes in the resampled training data by printing the class distribution after SMOTE.

5. Threshold Moving

In classifiers, predictions are often expressed as probabilities of class membership. The conventional threshold for assigning predictions to classes is typically set at 0.5. However, in the context of imbalanced class problems, this default threshold may not yield optimal results. To enhance classifier performance, it is essential to adjust the threshold to a value that efficiently discriminates between the two classes.

Techniques such as ROC Curves and Precision-Recall Curves are employed to identify the optimal threshold. Additionally, grid search methods or exploration within a specified range of values can be utilized to pinpoint the most suitable threshold for the classifier.

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
 
# Train a classification model (Random Forest as an example)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
 
# Predict probabilities on the test set
y_proba = model.predict_proba(X_test)[:, 1]
 
# Set a threshold (initially 0.5)
threshold = 0.5
 
# Adjust threshold based on your criteria (e.g., maximizing F1-score)
while threshold >= 0:
    y_pred = (y_proba >= threshold).astype(int)
    f1 = f1_score(y_test, y_pred)
 
    print(f"Threshold: {threshold:.2f}, F1 Score: {f1:.4f}")
 
    # Move the threshold (you can customize the step size)
    threshold -= 0.02

                    

Output:

Threshold: 0.50, F1 Score: 1.0000
Threshold: 0.48, F1 Score: 1.0000
Threshold: 0.46, F1 Score: 1.0000
Threshold: 0.44, F1 Score: 1.0000
Threshold: 0.42, F1 Score: 1.0000
Threshold: 0.40, F1 Score: 1.0000
Threshold: 0.38, F1 Score: 1.0000
Threshold: 0.36, F1 Score: 1.0000
Threshold: 0.34, F1 Score: 1.0000
Threshold: 0.32, F1 Score: 1.0000
Threshold: 0.30, F1 Score: 1.0000
Threshold: 0.28, F1 Score: 0.9973
Threshold: 0.26, F1 Score: 0.9973
Threshold: 0.24, F1 Score: 0.9973
Threshold: 0.22, F1 Score: 0.9947
Threshold: 0.20, F1 Score: 0.9947
Threshold: 0.18, F1 Score: 0.9947
Threshold: 0.16, F1 Score: 0.9920
Threshold: 0.14, F1 Score: 0.9920
Threshold: 0.12, F1 Score: 0.9894
Threshold: 0.10, F1 Score: 0.9842
Threshold: 0.08, F1 Score: 0.9740
Threshold: 0.06, F1 Score: 0.9664
Threshold: 0.04, F1 Score: 0.9664
Threshold: 0.02, F1 Score: 0.9664

6. Using Tree Based Models

The hierarchical structure of tree-based models—such as Decision Trees, Random Forests, and Gradient Boosted Trees—allows them to handle imbalanced datasets better than non-tree-based models.

  • Decision Trees: Decision trees create a structure resembling a tree by dividing the feature space into regions according to feature values. By changing the decision boundaries to incorporate minority class patterns, decision trees can react to data that is unbalanced. They might experience overfitting, though.
  • Random Forests: Random Forests are made up of many Decision Trees that have been trained using arbitrary subsets of features and data. Random Forests improve generalization by reducing overfitting and strengthening robustness against imbalanced datasets by mixing numerous trees.
  • Gradient Boosted Trees: Boosted Gradient Trees grow in a sequential fashion, with each new growth repairing the mistakes of the older one. Gradient Boosted Trees perform well in imbalanced circumstances because of their ability to concentrate on misclassified occurrences through sequential learning. Although they often work effectively, they could be noise-sensitive.

7. Using Anomaly Detection Algorithms

  • Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
  • In an Imbalanced dataset, assume  ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
  • These algorithms are trained on Normal data.
  • A trained model can predict if the new record is Normal or Outlier.

Frequently Asked Questions (FAQs)

1. What is imbalanced data in classification?

Imbalanced data in classification refers to a dataset where the distribution of class labels is uneven, with one class significantly outnumbering the other. This imbalance can pose challenges for machine learning models, as they might exhibit bias towards the majority class, leading to poor performance in predicting the minority class.

2. Why is imbalanced data a problem in machine learning?

Imbalanced data is problematic in machine learning because models trained on such datasets may prioritize accuracy on the majority class while neglecting the minority class. This can result in biased models that perform poorly in identifying and generalizing patterns related to the minority class.

3.What are common techniques to handle imbalanced data?

Common techniques to handle imbalanced data include oversampling the minority class, undersampling the majority class, using synthetic data generation methods (e.g., SMOTE), adjusting class weights, and employing specialized algorithms designed for imbalanced datasets.

4. How does oversampling and undersampling work?

Oversampling involves creating additional instances of the minority class to balance the class distribution, while undersampling reduces the number of instances in the majority class. Both techniques aim to create a more balanced dataset for training machine learning models.

5. When to use ensemble methods for imbalanced data?

Ensemble methods, such as Random Forests and BalancedBaggingClassifier, are effective for handling imbalanced data as they inherently address class imbalances. They can be particularly useful when there is a need for combining multiple models to achieve better generalization and robustness.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads