In machine learning, “imbalanced classes” is a familiar problem particularly occurring in classification when we have datasets with an unequal ratio of data points in each class. Training of model becomes much trickier as typical accuracy is no longer a reliable metric for measuring the performance of the model. Now if the number of data points in minority class is much less, then it may end up being completely ignored during training.
Over/Up-Sample Minority Class
In Up-sampling, samples from minority classes are randomly duplicated so as to achieve equivalence with the majority class. There are many methods used for achieving this.
1. Using scikit-learn:
This can be done by importing resample module from scikit-learn.
Syntax: sklearn.utils.resample(*arrays, replace=True, n_samples=None, random_state=None, stratify=None)
Parameters:
- *arrays: Dataframe/lists/arrays
- replace: Implements resampling with or without replacement. Boolean type of value. Default value is True.
- n_samples: Number of samples to be generated. Default value is None. If value is None then first dimension of array is taken automatically. This value will not be longer than length of arrays if replace is given as False.
- random_state: Used for shuffling the data. If positive non zero number is given then it shuffles otherwise not. Default value is None.
- stratify: Data is split in stratified fashion if set to True. Default value is None.
Return Value: Sequence of resampled data.
Example :
Python3
from sklearn.utils import resample
from sklearn.datasets import make_classification
import pandas as pd
X, y = make_classification(n_classes = 2 ,
weights = [ 0.8 , 0.2 ],
n_features = 4 ,
n_samples = 100 ,
random_state = 42 )
df = pd.DataFrame(X, columns = [ 'feature_1' ,
'feature_2' ,
'feature_3' ,
'feature_4' ])
df[ 'balance' ] = y
print (df)
df_major = df[df.balance = = 0 ]
df_minor = df[df.balance = = 1 ]
df_minor_sample = resample(df_minor,
replace = True ,
n_samples = 80 ,
random_state = 42 )
df_sample = pd.concat([df_major, df_minor_sample])
print (df_sample.balance.value_counts())
|
Output:

Explanation :
- Firstly, we’ll divide the data points from each class into separate DataFrames.
- After this, the minority class is resampled with replacement by setting the number of data points equivalent to that of the majority class.
- In the end, we’ll concatenate the original majority class DataFrame and up-sampled minority class DataFrame.
2. Using RandomOverSampler:
This can be done with the help of the RandomOverSampler method present in imblearn. This function randomly generates new data points belonging to the minority class with replacement (by default).
Syntax: RandomOverSampler(sampling_strategy=’auto’, random_state=None, shrinkage=None)
Parameters:
- sampling_strategy: Sampling Information for dataset.Some Values are- ‘minority’: only minority class ‘not minority’: all classes except minority class, ‘not majority’: all classes except majority class, ‘all’: all classes, ‘auto’: similar to ‘not majority’, Default value is ‘auto’
- random_state: Used for shuffling the data. If a positive non-zero number is given then it shuffles otherwise not. Default value is None.
- shrinkage: Parameter controlling the shrinkage. Values are: float: Shrinkage factor applied on all classes. dict: Every class will have a specific shrinkage factor. None: Shrinkage= 0. Default value is None.
Example:
Python3
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
X, y = make_classification(n_classes = 2 ,
weights = [ 0.8 , 0.2 ],
n_features = 4 ,
n_samples = 100 ,
random_state = 42 )
t = [(d) for d in y if d = = 0 ]
s = [(d) for d in y if d = = 1 ]
print ( 'Before Over-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
OverS = RandomOverSampler(random_state = 42 )
X_Over, Y_Over = OverS.fit_resample(X, y)
t = [(d) for d in Y_Over if d = = 0 ]
s = [(d) for d in Y_Over if d = = 1 ]
print ( 'After Over-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
|
Output:

3. Synthetic Minority Oversampling Technique (SMOTE):
SMOTE is used to generate artificial/synthetic samples for the minority class. This technique works by randomly choosing a sample from a minority class and determining K-Nearest Neighbors for this sample, then the artificial sample is added between the picked sample and its neighbors. This function is present in imblearn module.
Syntax: SMOTE(sampling_strategy=’auto’, random_state=None, k_neighbors=5, n_jobs=None)
Parameters:
- sampling_strategy: Sampling Information for dataset
- random_state: Used for shuffling the data. If positive non zero number is given then it shuffles otherwise not. Default value is None.
- k_neighbors: Number count of nearest neighbours used to generate artificial/synthetic samples. Default value is 5
- n_jobs: Number of CPU cores to be used. Default value is None. None here means 1 not 0.
Example:
Python3
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(n_classes = 2 ,
weights = [ 0.8 , 0.2 ],
n_features = 4 ,
n_samples = 100 ,
random_state = 42 )
t = [(d) for d in y if d = = 0 ]
s = [(d) for d in y if d = = 1 ]
print ( 'Before Over-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
smote = SMOTE()
X_OverSmote, Y_OverSmote = smote.fit_resample(X, y)
t = [(d) for d in Y_OverSmote if d = = 0 ]
s = [(d) for d in Y_OverSmote if d = = 1 ]
print ( 'After Over-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
|
Output:

Explanation:
- Minority class is given as input vector.
- Determine its K-Nearest Neighbours
- Pick one of these neighbors and place an artificial sample point anywhere between the neighbor and sample point under consideration.
- Repeat till the dataset gets balanced.
Advantages of Over-Sampling:
- No information loss
- Better than under sampling
Disadvantages of Over-Sampling:
- Increased chances of over-fitting as duplicates of minority class is made
Down/Under-sample Majority Class
Down/Under Sampling is the process of randomly selecting samples of majority class and removing them in order to prevent them from dominating over the minority class in the dataset.
1. Using scikit-learn :
It is similar to up-sampling and can be done by importing resample module from scikit-learn.
Example :
Python3
from sklearn.utils import resample
from sklearn.datasets import make_classification
import pandas as pd
X, y = make_classification(n_classes = 2 ,
weights = [ 0.8 , 0.2 ],
n_features = 4 ,
n_samples = 100 ,
random_state = 42 )
df = pd.DataFrame(X, columns = [ 'feature_1' ,
'feature_2' ,
'feature_3' ,
'feature_4' ])
df[ 'balance' ] = y
print (df)
df_major = df[df.balance = = 0 ]
df_minor = df[df.balance = = 1 ]
df_major_sample = resample(df_major,
replace = False ,
n_samples = 20 ,
random_state = 42 )
df_sample = pd.concat([df_major_sample, df_minor])
print (df_sample.balance.value_counts())
|
Output:

Explanation :
- Firstly, we’ll divide the data points from each class into separate DataFrames.
- After this, the majority class is resampled without replacement by setting the number of data points equivalent to that of the minority class.
- In the end we’ll concatenate the original minority class DataFrame and down-sampled majority class DataFrame.
2: Using RandomUnderSampler
This can be done with the help of RandomUnderSampler method present in imblearn. This function randomly selects a subset of data for the class.
Syntax: RandomUnderSampler(sampling_strategy=’auto’, random_state=None, replacement=False)
Parameters:
- sampling_strategy: Sampling Information for dataset.
- random_state: Used for shuffling the data. If positive non zero number is given then it shuffles otherwise not. Default value is None.
- replacement: Implements resampling with or without replacement. Boolean type of value. Default value is False.
Example:
Python3
from imblearn.under_sampling import RandomUnderSampler
from sklearn.datasets import make_classification
X, y = make_classification(n_classes = 2 ,
weights = [ 0.8 , 0.2 ],
n_features = 4 ,
n_samples = 100 ,
random_state = 42 )
t = [(d) for d in y if d = = 0 ]
s = [(d) for d in y if d = = 1 ]
print ( 'Before Under-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
UnderS = RandomUnderSampler(random_state = 42 ,
replacement = True )
X_Under, Y_Under = UnderS.fit_resample(X, y)
t = [(d) for d in Y_Under if d = = 0 ]
s = [(d) for d in Y_Under if d = = 1 ]
print ( 'After Under-Sampling: ' )
print ( 'Samples in class 0: ' , len (t))
print ( 'Samples in class 1: ' , len (s))
|
Output:

Advantages of Under-Sampling :
- Better run time
- Improves storage problem as training examples are reduced
Disadvantages of Under-Sampling :
- May discard potential important information.
- The sample chosen can be a biased instance.
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
19 Dec, 2021
Like Article
Save Article