Open In App

Stratified Random sampling – An Overview

Stratified Random Sampling is a technique used in Machine Learning and Data Science to select random samples from a large population for training and test datasets. When the population is not large enough, random sampling can introduce bias and sampling errors. Stratified Random Sampling ensures that the samples adequately represent the entire population.

Stratified Random Sampling eliminates this problem of having bias in the sample dataset, by dividing the population into smaller sub-groups and randomly picking samples from them. In this article, we will deep into the world of Random Sampling and see how Stratified Random Sampling is better than traditional Random Sampling.



What is Stratified Random Sampling?

Unlike the traditional Random Sampling method, in which some values are picked randomly from a population without considering any factor or feature, Stratified Random Sampling first splits the entire population into smaller subsets known as Strata (The singular term of Strata – Stratum, which means a single subgroup, All the stratum are collectively known as Strata), this is done based on a particular characteristic present in the data. In simpler terms, the data are being sorted out from the population based on their feature.



Now, after dividing the entire population into smaller sub-groups based on the feature, the process of random sampling from those Strata takes place. Due to this approach, all the characteristics or features present in the entire population will be reflected in the sample dataset, eliminating the bias present in it. In the case of Random Sampling, there is always a chance that there can be a sampling error due to the bias present in the population. But using the Stratified approach, all the features of the elements or values present in the dataset will be considered equally and they will be reflected in the Sample dataset, this will make the Machine Learning model more accurate.

Stratified random sampling

In simple terms, the entire Stratified Random Sampling consists of two main steps –

What are the Types of Stratified Random Sampling?

Mainly, There are two types Stratified Random Sampling possible –

  1. Proportionated Stratified Random Sampling
  2. Disproportionated Stratified Random Sampling

Proportionated Stratified Random Sampling –

It is a type of Stratified Random Sampling in which the number of random samples taken from each stratum (a single group of strata), that number is solely based upon how big the stratum is as compared to the whole population. In other words, the amount or the fraction of the sample taken from a stratum matches the fraction of that stratum in the entire population.

In proportionate stratified random sampling, the sample size for each stratum is proportional to the stratum’s size in the population. This means that if a stratum represents 20% of the population, then 20% of the sample should be selected from that stratum.

This type of stratified random sampling is most commonly used when the strata are relatively homogeneous in size. It ensures that the sample is representative of the entire population, but it may not be as efficient as other sampling techniques if some strata are much smaller than others.

Example: Surveying student satisfaction in a university with freshmen, sophomores, juniors, and seniors.

Disproportionate stratified random sampling

In disproportionate stratified random sampling, the sample size for each stratum is not proportional to the stratum’s size in the population. This means that a stratum that is considered more important for the analysis may be oversampled, while a stratum that is less important may be undersampled.

This type of stratified random sampling is most commonly used when the strata are heterogeneous in size or when some strata are considered more important than others. It can be more efficient than proportionate stratified random sampling, but it may not be as representative of the entire population.

In this kind of Stratified Random Sampling method, without considering the proportion or any other factor, we will just provide a specific number to fetch samples from the population.

Example: Surveying residents’ opinions on a public transportation system in three districts with different population sizes.

What are the benefits of Stratified Random Sampling?

The benefits of Stratified Random Sampling is-

  1. Improved Precision – By dividing the entire population into smaller-subgroups called “strata” based on the feature and characteristics of the elements present in the population, stratified sampling improves the precision by a lot, as all the features are being equally represented in the sample dataset.
  2. Enhanced Comparisons – When the main goal is to compare the characteristics of each element present in the population, stratified sampling is the most preferrable option to try. It ensures that all the features present in the population is well-presented in the sample dataset. This makes it easy to create a biasless and more accurate Machine Learning model.
  3. Resource Efficiency – Dividing the population into strata helps in allocating the resources efficiently, as those subgroups which need to be given emphasis. more resources can be allocated to them easily and vice versa.

How to Conduct Stratified Random Sampling?

Now, we will see, how we could perform Stratified Random Sampling, in a stepwise manner.

Step – 1 : Define your population and subgroups

The first step of any sampling process is to define the Population from which we will collect our samples. Then the main task is to identify and select certain characteristic based on which we want to divide the population and create the subgroups i.e strata. This is very important step as defining the unique characteristic using which we will divide the population into sub-groups and form the strata. It is recommended to choose a clear and unique feature which will differentiate each other clearly, so that they can be put into different strata. Otherwise if there is an overlap of feature then forming the strata might get difficult.

It is also possible to use multiple columns/feature to stratify the dataset and creater sub-groups, as long as they can uniquely differentiate with other columns/feature of the dataset.

Step 2: Separate the population into strata

Now, consider each and every member of the population and add them into different stratum based on their charateristic and unique feature. The collection of all the stratum is known as strata.

Step 3: Decide on the sample size for each stratum

Before deciding the sample size of each stratum, it is necessary to decide which type of Stratified Random Sampling we will use, Proportionate or Disproportionate. In case of the proportionate sampling, the size of the sample from each stratum is in proportion with how much that stratum makes up the population. If the stratum is a big part of the population then we will consider larger amount of sample from that stratum, and vice versa for smaller part.

In case of disproportionate sampling, there is no need to consider the proportionate of the stratum with the population.

After deciding which method to use, it is time to decide the sampling size, the sample size should be large enough so that data from each stratum are equally represented in the sample dataset and we can do statistical analysis properly in it.

Step 4: Randomly sample from each stratum

Now we will random sampling method to collect data randomly from each stratum and form our sample dataset. Once we have sampled from each stratum, we need to combine all of the samples into one representative sample. This can be done by simply concatenating the samples together.

Applications of Stratified Random Sampling

Stratified Random Sampling is commonly used in numerous research and facts series scenarios, along with

When to use Stratified Random Sampling?

There are certain scenarios in which Stratified Random Sampling will work better than that of the simple Random Sampling method. Some of them are listed below –

Comparison with other Sampling Methods

Criteria

Simple Random Sampling

Systematic sampling

Startified Random sampling

Cluster Sampling

Definition

Everyone has the equal chance of being included.

A pre-defined and fixed sampling interval is used

population is divided into strata and data are collected randomly from strata

Population is divided into clusters and a subset of those clusters are used for analysis

Advantage

Simplicity, Easy to Implement

Could me more efficient than Random Sampling if there is an order in the population

Ensures each feature is present in the sample dataset

Efficient for vast and geographically dispersed population.

Disadvantage

May not represent all the feature in the sample dataset

Sensitive to Periodicities present in population

Complex to implement, if done wrongly then the model will be errorneous

Introductionof bias if the clusters are not homogeneous

When to Use

The population is homogeneous

The population posesses a certain pattern

The population has distinct unique features

The clusters are similar and capable of representing entire dataset

Efficiency Consideration

Less efficient when the dataset is heterogeneous

Efficient for ordered population

Most efficient for heterogeneous population

Efficient for vast and geographically disperesed population

Complexity of Implementation

Low

Moderate

Moderate

Moderate

Characteristics of Stratified Random Sampling

  1. Division into Strata – The population is divided into sub-groups known as “strata” based on the features/ characteristics of them.
  2. Homogeneity Within Strata – Strata are formed in such a manner that all the elements of each stratum are homogeneous or share the same unique characteristics, using which they can be independently identified.
  3. Random Sampling Within Strata – In every stratum, Random Sampling is done to ensure that each stratum contributes equally in forming the sample dataset.
  4. Statistical Weighting – After forming the sample dataset, different statistical approaches can be taken to measure the contribution of each of the stratum in forming the sample dataset.
  5. Increased Precision – The aim of Stratified Sampling is to increase the precision of the Machine Learning model which will be created using that sample dataset.

Advantages of Stratified Random Sampling

  1. Improved Representation of the Population – By dividing the entire population based on some features and then doing random sampling from them, ensures that all the features present in the population will be reflected in the Sample Dataset. This make sure that no feature is left of and the Machine Learning model gets to analyse all the present features of the population.
  2. Reduction of Sampling Error – Using Stratified Random Sampling reduces the sampling error severely as compared to the simple random sampling. As the former ensures that all the unique feature present in the population are reflected proportionately in the Sample Dataseet.
  3. Increased Efficiency – When there is a need to put emphasis or allocate extra resources on a certain feature or characteristics present in the population, using stratified random sampling to divide the population into various strata based on different features, then it would be possible to allocate resources based on the importance of each strata.
  4. Reduced Bias – Stratified Random Sampling reduces bias heavily which might cause undersampling or oversampling. This is important when there are minority groups or when unique characteristics are of interest.
  5. Greater Precision – As each stratum is sampled independently based on their unique characteristics, the precision of the Machine Learning model as well as the Sample Dataset can be maximized.
  6. Generalizability – The Sample Datasets made using Stratified Random Sampling will generalize the entire population and it’s unique features proportionately, so that the sample dataset represents the entire population equally.

Disadvantages of Stratified Random Sampling

Even though Stratified Random Sampling has it’s own advantages, it comes with several disadvantages too.

  1. Complexity – The complexity of the Stratified Random Sampling is greater than that of the Simple Random Sampling, as the former consists of more steps than that of the later. Developers need to find specific and relevant characteristics of the elements present in the population, then divide them based on that and form strata and them do the random sampling from that strata. This extra step or complexity may lead to some issues if the stratification is not done properly or the population was not that diverse.
  2. Selection of Stratification Feature – Choosing the right feature to stratify the data is crucial, and if not done properly then the sample dataset might not represent the population properly and it will eventually lead to a faulty machine learning model and result.
  3. Resource Intensive – When the population is huge and has lots of unique features, then using this approach will create a very large number of strata and the sample size will also become huge and using that large and diverse sample to train the model will require a lot of resources as well as time.
  4. Overstratification – Sometimes dividing the data into too many sub-groups can lead to overstratification, meaning that the features used to divide them might not be entirely unique or can have some similarities between them. This will make it impossible to gather adequate data to form the sample dataset due to so much diversification and smaller strata.
  5. Practical Challenges – Practically sometimes it become impossible to execute the Stratified Random Sampling approach due to the lesser amount of variety present in the population or the unique features are not well defined and is not possible to define them separately.
  6. Relation between the strata – Sometimes there is a hidden or underlying relationship between the features or the stratums which the developers might be unaware of. This may lead into misleading results if not taken care while forming the strata.

Conclusion

In conclusion, Stratified Random Sampling stand at the pinnacle position when it comes to statistical sampling. By considering the diversity and the uniqueness of the population, and dividing that population into smaller groups called “strata” , this approach increases representatives of a sample. The thoughtful creation of the strata ensures that all the unique features of the population are given importance equally into the sample dataset, which results in more accurate and reliable results. Stratified Random Sampling is mostly useful while dealing with heterogeneous population. Its potential to improve the precision and reduction of sampling error makes it the most preferred choice for research and analysis purposes where comprehensive and well-structured sampling is unparalleled.

Frequently Asked Questions (FAQ)

1. What is Stratified Random Sampling?

It is a kind of sampling technique in which the population is divided into subgroups known as strata, based on the unique features they possess. Each single subgroup of a strata is called “stratum”. Random samples are then collected from these stratum to build the final sample dataset. This technique ensures that all the feature present in the population are equally considered while making the sample dataset.

2. Why Stratified Random Sampling is used?

The main reason to use Stratified Random Sampling is to give importance to all the unique features available in the population, so that there is no bias or sampling error present in the sample dataset. Also the machine learning model training using the stratified random sampling dataset will yield more precise accuracy than normal random sampling datasets.

3. How is Stratified Random Sampling Different from Simple Random Sampling?

In case of Simple Random Sampling, the samples are picked randomly from the entire population, without keeping in mind about any other constraint like unique features or anything, due to this if the population is vast and has many unique features, sometimes all of them might get left out of the sample dataset. But in case of former as the population is being divided based on unique groups and then random sampling is done, all the features are being considered into the sample dataset.

4. What is the Purpose of Stratification in Stratified Random Sampling?

The main purpose of Stratification in the Stratified Random Sampling is to group elements together who have some common features amongst them. Doing this help in forming the sample dataset more precisely where every unique feature will be considered into the sample dataset.

5. Can Stratified Random Sampling Improve Precision?

Stratified Random Sampling increases precision in estimates. Considering each stratum separately, the researchers can get a more accurate view of the population, which eventually leads into creating more precise Machine Learning model


Article Tags :