# Handling Imbalanced Data for Classification

• Last Updated : 17 Jul, 2020

Balanced vs Imbalanced Dataset :

• Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of classes in the target column.
• Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution of classes in the target column.

Let’s understand this with the help of an example :
Example : Suppose there is a Binary Classification problem with the following training data:

• Total Observations : 1000
• Target variable class is either  ‘Yes’ or ‘No’.

Case 1:
If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly unequal distribution of the two classes. .

Case 2:
If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is approximately equal distribution of the two classes.

Hence, there is a significant amount of difference between the sample sizes of the two classes in an Imbalanced Dataset.

Problem with Imbalanced dataset:

• Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
• Minority class observations look like noise to the model and are ignored by the model.
• Imbalanced dataset gives misleading accuracy score.

Techniques to deal with Imbalanced dataset :

• Under Sampling :
In this technique, we reduce the sample size of Majority class and try to match it with the sample size of Minority Class.

Example :
Let’s take an imbalanced training dataset with 1000 records.

Before Under Sampling :

• Target class ‘Yes’ = 900 records
• Target class ‘No’ = 100 records

After Under Sampling :

• Target class ‘Yes’ = 100 records
• Target class ‘No’ = 100 records

Now, both classes have the same sample size.

Pros :

• Low computation power  needed.

Cons :

• Some important patterns might get lost due to dropping of records.
• Only  beneficial for huge datasets with millions of records.

Note : Under Sampling should only be done when we have huge number of records.

• Over Sampling :
In this technique, we increase the sample size of Minority class by replication and try to match it with the sample size of Majority Class.

Example :
Let’s take the same imbalanced training dataset with 1000 records.

Before Over Sampling :

• Target class ‘Yes’ = 900 records
• Target class ‘No’ = 100 records

After Over Sampling :

• Target class ‘Yes’ = 900 records
• Target class ‘No’ = 900 records

Pros :

• Patterns are not lost which enhances the model performance.

Cons :

• Replication of the data can lead to overfitting.
• High computation power  needed.

So, Which one to choose ‘Under Sampling’ or ‘Over Sampling’ ?

It depends upon the dataset. If we have a huge dataset then chooseUnder sampling’ otherwise go with ‘Over Sampling’.

Using Tree Based Models :
‘Tree-based models’ find it easy to deal with Imbalanced dataset compared to Non-tree based Models due to their hierarchical structure.

Different Tree Based Models are :

• Decision Trees
• Random Forests