Handling Imbalanced Data for Classification

Balanced vs Imbalanced Dataset :

  • Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of classes in the target column.
  • Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution of classes in the target column.

Let’s understand this with the help of an example :
Example : Suppose there is a Binary Classification problem with the following training data:

  • Total Observations : 1000  
  • Target variable class is either  ‘Yes’ or ‘No’.

Case 1:
If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly unequal distribution of the two classes. .

Case 2:
If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is approximately equal distribution of the two classes.

Hence, there is a significant amount of difference between the sample sizes of the two classes in an Imbalanced Dataset.



Problem with Imbalanced dataset:

  • Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
  • Minority class observations look like noise to the model and are ignored by the model.
  • Imbalanced dataset gives misleading accuracy score.

Techniques to deal with Imbalanced dataset :

  • Under Sampling :
    In this technique, we reduce the sample size of Majority class and try to match it with the sample size of Minority Class.

    Example :
    Let’s take an imbalanced training dataset with 1000 records.

    Before Under Sampling :

    • Target class ‘Yes’ = 900 records
    • Target class ‘No’ = 100 records

    After Under Sampling :

    • Target class ‘Yes’ = 100 records
    • Target class ‘No’ = 100 records

    Now, both classes have the same sample size.

    Pros :



    • Low computation power  needed.

    Cons :

    • Some important patterns might get lost due to dropping of records.
    • Only  beneficial for huge datasets with millions of records.

    Note : Under Sampling should only be done when we have huge number of records.

  • Over Sampling :
    In this technique, we increase the sample size of Minority class by replication and try to match it with the sample size of Majority Class.

    Example :
    Let’s take the same imbalanced training dataset with 1000 records.

    Before Over Sampling :

    • Target class ‘Yes’ = 900 records
    • Target class ‘No’ = 100 records

    After Over Sampling :

    • Target class ‘Yes’ = 900 records
    • Target class ‘No’ = 900 records

    Pros :

    • Patterns are not lost which enhances the model performance.

    Cons :

    • Replication of the data can lead to overfitting.
    • High computation power  needed.

So, Which one to choose ‘Under Sampling’ or ‘Over Sampling’ ?

It depends upon the dataset. If we have a huge dataset then chooseUnder sampling’ otherwise go with ‘Over Sampling’.

Using Tree Based Models :
‘Tree-based models’ find it easy to deal with Imbalanced dataset compared to Non-tree based Models due to their hierarchical structure.

Different Tree Based Models are :

  • Decision Trees
  • Random Forests
  • Gradient Boosted Trees

Using Anomaly Detection Algorithms:

  • Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
  • In an Imbalanced dataset, assume  ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
  • These algorithms are trained on Normal data.
  • A trained model can predict if the new record is Normal or Outlier.

 

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.