Open In App

How to Implement Stratified Sampling with Scikit-Learn

In this article, we will learn about How to Implement Stratified Sampling with Scikit-Learn.

What is Stratified sampling?

Stratified sampling is a sampling technique in which the population is subdivided into groups based on specific characteristics relevant to the problem before sampling. The samples are drawn from this group with ample sizes proportional to the size of the subgroup in the population and combined to form the final sample. The purpose is to ensure that all subgroup is represented proportionally in the final sample.



Stratified sampling is particularly useful when there are known variations within the population that could significantly impact the model results.

How to perform stratified sampling?

To perform stratified sampling, you need to follow the steps discussed below:



  1. Define Strata: Identify and define the state of subgroups with the population based on the characteristic(s) of interest such as race, gender, income, level of education, age group, etc.
  2. Sample Size: Determine overall sample size and individual subgroup sample size ensuring that the ratio of each subgroup chosen is proportionally representative of the overall population.
  3. Select samples: Randomly select samples from each identified stratum by applying random sampling techniques, such as simple random sampling or systematic random sampling.
  4. Final Sample: Combine all samples from the various strata into a unified representative sample.

When to use stratified sampling?

  1. Population Heterogeneity: When the population can be divided into mutually exclusive subgroups based on a specific characteristic.
  2. Equal Representation: When we want to ensure a specific characteristics or group of characteristics is adequately represented in the final sample.
  3. Resource Constraints: When you want to generalize the study results to the entire population and ensure that the estimates are valid for each stratum, but the resources are limited.

Examples include Market research where the stratification is generally down by income levels or age groups, and Election exit poll where stratification is generally done by demographics.

Advantages of stratified sampling

  1. Precision: By ensuring representation from all relevant subgroups, stratified sampling often provides more precise and reliable estimates for the entire population.
  2. Generalization: Results from the sample can be more confidently generalized to the entire population because each subgroup is accounted for in the sampling process.
  3. Reduced Variability: Stratified sampling can reduce the overall variability in the sample, leading to more accurate and meaningful statistical analyses.

Comparison started with other Sampling Techniques

Stratified sampling is just one of several sampling techniques used in research. Let’s compare stratified sampling with a few other common sampling techniques:

Stratified Sampling:

  1. Ensures representation from all subgroups. Useful when there is significant variability within the population.
  2. Requires knowledge about the population characteristics for effective stratification.

Simple Random Sampling:

  1. Simple Random Sampling is easy to implement, especially when the population is homogeneous.
  2. May not capture variability within the population, and certain subgroups may be underrepresented.

Cluster Sampling

  1. In cluster sampling, population is naturally grouped into clusters, which may not necessarily be based on characteristics of interest.
  2. The entire cluster becomes the sampling unit.
  3. Clusters are randomly selected, and all individuals within the selected clusters are included in the sample.
  4. Suitable for geographically dispersed populations, reduces costs and time.

Quota Sampling

  1. Quota sampling involves dividing the population into subgroups or quotas based on certain characteristics.
  2. The main difference is that in stratified sampling we draw a random sample from each subgroup (probability sampling). In quota sampling, we set predetermined quotas for specific characteristics based on our knowledge. Also, the samples selected are non-random meaning the researcher can use convenience or judgmental sampling to meet the predetermined quotas.

Systematic sampling

  1. Systematic sampling is a method of sampling where every nth member of a population is selected for inclusion in the sample after the first member is randomly chosen. This is done by selecting a random starting point and then picking every kth element from the population. The value of “k” is determined by dividing the total size of the population by the desired sample size.

Implementing Stratified Sampling

Let us load the iris dataset to implement stratified sampling.




import pandas as pd
from sklearn import datasets
 
iris = datasets.load_iris()
iris_df=pd.DataFrame(iris.data)
 
iris_df['class']=iris.target
iris_df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
 
iris_df['class'].value_counts()

Let us use the scikit-learn’s train_test_split function from scikit-learn’s model_selection module to split a dataset into training and testing sets.

Let us see class distribution when stratify is set to None.




from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8,
                                                   random_state=None,
                                                   shuffle=True, stratify=None)
 
print("Class distribution of train set")
print(y_train.value_counts())
print()
print("Class distribution of test set")
print(y_test.value_counts())

Output:

 Class distribution of train set
0    43
2    40
1    37
Name: class, dtype: int64
Class distribution of test set
1    13
2    10
0     7
Name: class, dtype: int64

Let us see class distribution when stratify is set to True.




from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.8,
                                                   random_state=None,
                                                   shuffle=True, stratify=y)
print(y_train.value_counts())
print(y_test.value_counts())

Output:

Class distribution of train set
0    40
2    40
1    40
Name: class, dtype: int64
Class distribution of test set
2    10
1    10
0    10
Name: class, dtype: int64

If we want to use stratified sampling with k fold, we can use StratifiedShuffleSplit class from Scikit Learn as below.




import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
skf = StratifiedShuffleSplit(n_splits=2, train_size = .8)
X = iris_df.iloc[:,:-1]
y = iris_df.iloc[:,-1]
 
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}:")
    print(f" {iris_df.iloc[train_index]['class'].value_counts()}")
    print("-"*10)
    print(f" {iris_df.iloc[test_index]['class'].value_counts()}")
    print("*" * 60)

Output:

Fold 0:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
1    10
0    10
Name: class, dtype: int64
************************************************************
Fold 1:
 2    40
1    40
0    40
Name: class, dtype: int64
----------
 2    10
0    10
1    10
Name: class, dtype: int64
************************************************************

Conclusion

In this article, we saw how we can use stratified sampling to ensure that the final sample represents the population by ensuring that the characteristic of interest is neither underrepresented nor overrepresented.

Frequently Asked Questions (FAQs)

1. How is the sample size determined in stratified sampling?

The formula for calculating the sample size in each stratum is:

(Size of stratum in population/Total Population size) *Overall sample size

2. How does stratified sampling help in addressing bias?

Stratified sampling helps address bias by ensuring that each stratum, representing a subgroup with specific characteristics, is adequately represented in the sample.

3. Difference between stratified sampling and quota sampling?

The main difference is how, samples are selected after defining the strata. In stratified sampling the samples are drawn randomly, while in quota convenience sampling is used.


Article Tags :