Cluster Sampling

Sampling is a technique mostly used in data analysis and research. It is a technique in which we select a small part of the entire population to find out insights and draw conclusions about the whole population. Sampling can be done in many ways, and one of the common types of sampling is Clustered Sampling. In this article, we will see cluster sampling and its implementation in Python.

What is Clustered Sampling?

Clustered sampling is a type of sampling where an entire population is first divided into clusters or groups. Then, a random cluster is selected, from which data is collected, instead of collecting data from all the individuals from the entire population. Cluster sampling is most often used in cases where it is not practical to get a sample from the entire population.

A few examples of clusters that are already available are:

Geographic Clusters: To conduct a national survey, we must first select a random sample of states or cities, and then survey all individuals within those selected areas. This reduces the cost and challenges associated with surveying individuals across the entire country.
Schools or Classrooms: Generally, in educational research, we might randomly select a sample of schools or classrooms and then collect data from all students within those clusters.
Businesses or Organizations: When studying the performance of businesses or organizations, we could randomly select a sample of companies and then collect data from all employees within those companies.

This type of sampling is useful when there is a large population or when there is a natural grouping of the elements within the entire population, some of which are mentioned above.

Steps to Perform Clustered Sampling

The steps to perform simple clustered sampling are as follows:

Step 1: Define the Population

Firstly, we need to clearly define what population we need to study. This can be any geographical area, an organization, or any other according to our interest.

Population

Step-2: Create Groups/Clusters

Now, we divide the population into clusters or groups, and the groups do not overlap each other. They must be formed in such a way that they are unique internally, and externally similar. Each cluster must represent the entire population. There are also naturally occuring clusters like schools, cities etc.

Create groups/clusters

Step-3: Randomly Select Clusters

As each cluster is similar to each other, we may now do the random sampling technique i.e., select a random sample of clusters from all the clusters formed. It’s important that each cluster has a known and there is equal chance of being selected.

Randomly select clusters

Step-4: List Elements from Selected Cluster

Within each selected cluster, list all the elements within that cluster. For example, if the selected cluster is of grade 8th students in one school, we need to list all the students in that class. This step is done for our ease and understanding.

Step-5: Collect Data

Collect data from every individual in the list we made. The data collection can be done in various ways like surveys, interviews, observations, or any other method according to the type of population and our topic of interest.

data collection from clusters

Step-6: Analyze the Data

The final step after collecting the data is to perform analysis on data and draw conclusions about the population. This can be done through various data analysis techniques and we can take decisions according to the output obtained.

data analysis from collected data

Types of Clustered Sampling

Each type of cluster sampling has its own advantages and disadvantages. The three main types of clustered sampling are:

1. Single- Stage Cluster Sampling

The process for this type of cluster sampling is same as the steps given above.

Process

In single-stage cluster sampling, the entire population is first divided into clusters.
Then, a random sample of clusters is selected i.e., few clusters are randomly selected.
All elements within those selected clusters are surveyed to collect the data.
This is the data which is then used for data analysis for various purposes like taking business decisions etc.

single-stage cluster sampling

Example

Let’s take our population is certain city with around 100 houses and we want to find out the average income of households. Instead of surveying every house, we will use single-stage cluster sampling.

Firstly, we divide the population into clusters based on geographical proximity, let’s say we got 10 clusters within each cluster, we have 10 neighbourhoods.
Now, we randomly select 3 clusters, and collect data from every neighbourhood in that cluster.
This data can be used to know what is the average income of the household in the entire city.

Pros:

This type of cluster sampling is very helpful if there is a huge population and data collection from everyone is not possible.
It also reduces the need for extensive travel and data collection.
This can be a stepping stone for more complex cluster sampling processes where first cluster sampling is done and furthur sub-sampling is done within selected clusters.

Cons:

Single-stage cluster sampling can not be suitable for small populations because it may result in insufficient size.
Analyzing data collected through single-stage cluster sampling can be more complex than simple random sampling because it has to account with the clustering effect which is the tendency that the elements within the cluster are similar to each other.

2. Double- Stage Cluster Sampling

The process for this type of cluster sampling changes at the step 4 where even elements are sampled within selected clusters. This cluster sampling is more commonly used and can be more efficient than single-stage sampling. Let us see the process in detail.

Process

In double/two-stage cluster sampling, firstly, a random sample of clusters is selected.
Then, within each selected cluster, not all the elements but, a random sample of elements is selected i.e., only few elements within those selected clusters are chosen for data collection.
Now, the process is same as above, where the collected data can be used for data analysis which can furthur be used for various purposes.

double-stage cluster sampling

Example

If we want to find out the customer satisfaction in the retail chain across a large region. We will use double-stage cluster sampling. Let us say our population is a certain geographic area with around 20 cities.

The first stage of sampling is to select random 5 cities from those.
The second stage of sampling is we furthur divide the cities into zones/colonies. If each city has 10 zones, take 5 samples of zones from each city.
Now, within each selected zone we can collect data from customers and understand about the customer satisfaction.

Pros

This type of sampling is cost effective than simple random sampling.
It often saves time for data collection.
It is very efficient way to get a representative sample if population is organized into clusters.
This is practical if we want data from a large population.

Cons

If clusters are not properly defined, then the sample won’t be representing the entire population, hence getting a biased data.
This sampling method is not beneficial for small populations.
This two stage cluster sampling may be complex to design and implement than the simple random sampling and it may lead to an increase in errors.

3. Multi- Stage Cluster Sampling

Multi-stage cluster sampling involves more than two stages of sampling and is also more complex. The process is as mentioned below.

Process

Firstly, starts with the selection of larger clusters, then, the selection of smaller clusters within those, and, in some cases, even smaller clusters within those.
This method is used when the population is organized hierarchically, and smaller clusters can be selected within larger clusters.

multi-stage cluster sampling

Example

If the national government wants to assess the academic performance of the students. So, the population is entire country.

The first stage is divide the country into clusters by taking states or districts into consideration, now take random samples of around 10 states.
The second stage is within each state there are cities/districts. Let’s take sample of total 20 districts on an average from all the states.
The third stage is within each district divide into urban and rural area and take sample of around 2 schools from each category.
The fourth stage is to select certain classes from those schools, and take assessment for students.

In this way, we may assess the academic performance of the students in entire nation. This is only one example and the stages can be even more making this more complex sampling process.

Pros

Multi- stage cluster sampling can be really efficient when the data collection needs to be done in a highly diverse geographic region.
This may also reduce the costs as we are sampling at every stage and number of units to survey decreases drastically.
Random selection of clusters ensures that the samples are diverse and represents the entire population.
This is very useful in dealing with hierarchial populations like states, districts, schools, classes.

Cons

As this sampling involves many stages, the sampling process may become more complex.
The method can be susceptible to bias if the clusters selection at any stage is not done randomly or if there is a pattern in the population’s distribution.
The data collection can be very time consuming and requires extensive planning.

Implementation of Clustered Sampling in Python

Let us take an example of schools as population, and then collect data from the students. Now we will see how to perform single-stage cluster sampling in python programming language.

Load the necessary Libraries

Python3

import pandas as pd

import random

Import the required libraries, here it is ‘pandas’ and ‘random’.

Set random seed

Python3

#setting the random seed

random.seed(1)

Now, we set the random seed. This ensures that the numbers generated every time we run the code with same seed is same.

Create the custom Dataframe

Python3

# create a dataframe for population

population_data = pd.DataFrame(

    {

        'school_id': list(range(1, 51)),

        'students_count': [random.randint(50, 500) for _ in range(50)]

    }
)

print(population_data.head())

Output:

        school_id        students_count
0           1                     138
1           2                     237
2           3                     330
3           4                     409
4           5                     447

Create a data frame for schools. Here we create a simulated population and store as a dataframe. Each row represents a school (which is group/cluster here), and we assume that each school has a different number of students.

There are 50 schools created using the range function, and each school having unique id from 1 to 50. In each school, we allot random student count using random.randint() function of range within 50 to 500.

Select 10 random samples of schools

Python3

#select 10 random samples of schools

selected_clusters = random.sample(population_data['school_id'].tolist(), k=10)

print(selected_clusters)

Output:

[30, 39, 2, 15, 41, 12, 36, 38, 45, 6]

Now, we randomly select some clusters (which are schools) from the total 50 schools using the school_id from the dataframe population_data. We choose random 10 schools using the sample() function from random library, and store as a list called selected_clusters, which is done using the tolist() function. Here, k=10 represents that we are selecting 10 schools from entire 50 schools, which is a parameter of sample function.

Select all students from the randomply selected Schools

Python3

#select all students from the selected schools

sampled_data = population_data[population_data['school_id'].isin(selected_clusters)]

After selecting the clusters, we select all the students within those selected clusters from the population data. This step varies according to different types of cluster sampling. In the single stage, we extract all the members of the cluster, in double-stage we select samples even from the clusters. In the code snippet, we can observe we are selecting only those school students who belong to the selected schools.

Print the data of students

Python3

#print the data of students

print(sampled_data)

Output:

           school_id           students_count
0             1                          138
8             9                           94
17           18                          251
24           25                          207
32           33                          137
33           34                          136
35           36                          166
38           39                          152
42           43                          168
47           48                          345

Finally, now that we have the data, we can perform your analysis on the sampled data, such as calculating statistics or finding conclusions based on the selected clusters and their students.

Pros and Cons

Pros of Cluster Sampling:

Cheap: This method is cheaper than other sampling methods, like simple random sampling or stratified sampling. It’s because this method reduces the need to survey each and every element in the population and the efforts to sample each and every individual is decreased.
Practical: This is practically possible when we cannot survey each individual in a population because clusters/groups can be more easily recognised and can be accessed.
Increased Efficiency: This method increases efficiency in data collection, if the clusters are already naturally occurring groups (for example, households, schools, geographic regions) that are easier to sample together.

Cons of Cluster Sampling:

Less Precise: As this process involves collecting samples from the clusters, in some cases, it may result in less precise results compared to other sampling methods like simple random sampling.
Complex Analysis: Analyzing clustered data can be more difficult and complex. We would need to even face the clustering effect in the analysis, which can need some specialized statistical methods like multilevel modeling.
Risk of Bias: If the clusters are not good representation of the entire population or is not evenly distributed, it may result in the biased/wrong result.

Conclusion

Cluster sampling is very useful in cases where it is not possible to sample individual elements in a population i.e., to pick one by one element as sample from a big population. But, this also has few disadvantages like not giving precise results and the complex analysis process it has. Hence, for using this sampling technique you must consider both it’s advantages and disadvantages.

FAQs on Clustered Sampling

1. What is cluster sampling?

Cluster sampling is a type of sampling where an entire population is divided into cluster or groups. Now, randomly a cluster is selected for data collection, instead of individual elements like in normal sampling.

2. Why is cluster sampling used?

Cluster sampling is used to save time and resources when it’s not possible to collect samples one by one from the population. It’s also useful when natural groups exist in the population which are easy to find.

3. What is an example of cluster sampling in real life?

One of the common example of cluster is the schools. Schools are clusters, and selecting random schools is sampling, from those selected schools we collect data from all the students, this is data collection. Here, the schools are clusters, we choose two or three schools which is sampling, and we collect data from those selected schools which is data collection.

Article Tags :

AI-ML-DS

Data Science

Geeks Premier League

Geeks Premier League 2023