Open In App

Cluster Sampling

Sampling is a technique mostly used in data analysis and research. It is a technique in which we select a small part of the entire population to find out insights and draw conclusions about the whole population. Sampling can be done in many ways, and one of the common types of sampling is Clustered Sampling. In this article, we will see cluster sampling and its implementation in Python.

What is Clustered Sampling?

Clustered sampling is a type of sampling where an entire population is first divided into clusters or groups. Then, a random cluster is selected, from which data is collected, instead of collecting data from all the individuals from the entire population. Cluster sampling is most often used in cases where it is not practical to get a sample from the entire population.



A few examples of clusters that are already available are:

This type of sampling is useful when there is a large population or when there is a natural grouping of the elements within the entire population, some of which are mentioned above.



Steps to Perform Clustered Sampling

The steps to perform simple clustered sampling are as follows:

Step 1: Define the Population

Firstly, we need to clearly define what population we need to study. This can be any geographical area, an organization, or any other according to our interest.

Population

Step-2: Create Groups/Clusters

Now, we divide the population into clusters or groups, and the groups do not overlap each other. They must be formed in such a way that they are unique internally, and externally similar. Each cluster must represent the entire population. There are also naturally occuring clusters like schools, cities etc.

Create groups/clusters

Step-3: Randomly Select Clusters

As each cluster is similar to each other, we may now do the random sampling technique i.e., select a random sample of clusters from all the clusters formed. It’s important that each cluster has a known and there is equal chance of being selected.

Randomly select clusters

Step-4: List Elements from Selected Cluster

Within each selected cluster, list all the elements within that cluster. For example, if the selected cluster is of grade 8th students in one school, we need to list all the students in that class. This step is done for our ease and understanding.

Step-5: Collect Data

Collect data from every individual in the list we made. The data collection can be done in various ways like surveys, interviews, observations, or any other method according to the type of population and our topic of interest.

data collection from clusters

Step-6: Analyze the Data

The final step after collecting the data is to perform analysis on data and draw conclusions about the population. This can be done through various data analysis techniques and we can take decisions according to the output obtained.

data analysis from collected data

Types of Clustered Sampling

Each type of cluster sampling has its own advantages and disadvantages. The three main types of clustered sampling are:

1. Single- Stage Cluster Sampling

The process for this type of cluster sampling is same as the steps given above.

Process

single-stage cluster sampling

Example

Let’s take our population is certain city with around 100 houses and we want to find out the average income of households. Instead of surveying every house, we will use single-stage cluster sampling.

Pros:

Cons:

2. Double- Stage Cluster Sampling

The process for this type of cluster sampling changes at the step 4 where even elements are sampled within selected clusters. This cluster sampling is more commonly used and can be more efficient than single-stage sampling. Let us see the process in detail.

Process

double-stage cluster sampling

Example

If we want to find out the customer satisfaction in the retail chain across a large region. We will use double-stage cluster sampling. Let us say our population is a certain geographic area with around 20 cities.

Pros

Cons

3. Multi- Stage Cluster Sampling

Multi-stage cluster sampling involves more than two stages of sampling and is also more complex. The process is as mentioned below.

Process

multi-stage cluster sampling

Example

If the national government wants to assess the academic performance of the students. So, the population is entire country.

In this way, we may assess the academic performance of the students in entire nation. This is only one example and the stages can be even more making this more complex sampling process.

Pros

Cons

Implementation of Clustered Sampling in Python

Let us take an example of schools as population, and then collect data from the students. Now we will see how to perform single-stage cluster sampling in python programming language.

Load the necessary Libraries




import pandas as pd
import random

Import the required libraries, here it is ‘pandas’ and ‘random’.

Set random seed




#setting the random seed
random.seed(1)

Now, we set the random seed. This ensures that the numbers generated every time we run the code with same seed is same.

Create the custom Dataframe




# create a dataframe for population
population_data = pd.DataFrame(
    {
        'school_id': list(range(1, 51)),
        'students_count': [random.randint(50, 500) for _ in range(50)]
    }
)
print(population_data.head())

Output:

        school_id        students_count
0 1 138
1 2 237
2 3 330
3 4 409
4 5 447

Create a data frame for schools. Here we create a simulated population and store as a dataframe. Each row represents a school (which is group/cluster here), and we assume that each school has a different number of students.

There are 50 schools created using the range function, and each school having unique id from 1 to 50. In each school, we allot random student count using random.randint() function of range within 50 to 500.

Select 10 random samples of schools




#select 10 random samples of schools
selected_clusters = random.sample(population_data['school_id'].tolist(), k=10)
print(selected_clusters)

Output:

[30, 39, 2, 15, 41, 12, 36, 38, 45, 6]

Now, we randomly select some clusters (which are schools) from the total 50 schools using the school_id from the dataframe population_data. We choose random 10 schools using the sample() function from random library, and store as a list called selected_clusters, which is done using the tolist() function. Here, k=10 represents that we are selecting 10 schools from entire 50 schools, which is a parameter of sample function.

Select all students from the randomply selected Schools




#select all students from the selected schools
sampled_data = population_data[population_data['school_id'].isin(selected_clusters)]

After selecting the clusters, we select all the students within those selected clusters from the population data. This step varies according to different types of cluster sampling. In the single stage, we extract all the members of the cluster, in double-stage we select samples even from the clusters. In the code snippet, we can observe we are selecting only those school students who belong to the selected schools.

Print the data of students




#print the data of students
print(sampled_data)

Output:

           school_id           students_count
0 1 138
8 9 94
17 18 251
24 25 207
32 33 137
33 34 136
35 36 166
38 39 152
42 43 168
47 48 345

Finally, now that we have the data, we can perform your analysis on the sampled data, such as calculating statistics or finding conclusions based on the selected clusters and their students.

Pros and Cons

Pros of Cluster Sampling:

  1. Cheap: This method is cheaper than other sampling methods, like simple random sampling or stratified sampling. It’s because this method reduces the need to survey each and every element in the population and the efforts to sample each and every individual is decreased.
  2. Practical: This is practically possible when we cannot survey each individual in a population because clusters/groups can be more easily recognised and can be accessed.
  3. Increased Efficiency: This method increases efficiency in data collection, if the clusters are already naturally occurring groups (for example, households, schools, geographic regions) that are easier to sample together.

Cons of Cluster Sampling:

  1. Less Precise: As this process involves collecting samples from the clusters, in some cases, it may result in less precise results compared to other sampling methods like simple random sampling.
  2. Complex Analysis: Analyzing clustered data can be more difficult and complex. We would need to even face the clustering effect in the analysis, which can need some specialized statistical methods like multilevel modeling.
  3. Risk of Bias: If the clusters are not good representation of the entire population or is not evenly distributed, it may result in the biased/wrong result.

Conclusion

Cluster sampling is very useful in cases where it is not possible to sample individual elements in a population i.e., to pick one by one element as sample from a big population. But, this also has few disadvantages like not giving precise results and the complex analysis process it has. Hence, for using this sampling technique you must consider both it’s advantages and disadvantages.

FAQs on Clustered Sampling

1. What is cluster sampling?

Cluster sampling is a type of sampling where an entire population is divided into cluster or groups. Now, randomly a cluster is selected for data collection, instead of individual elements like in normal sampling.

2. Why is cluster sampling used?

Cluster sampling is used to save time and resources when it’s not possible to collect samples one by one from the population. It’s also useful when natural groups exist in the population which are easy to find.

3. What is an example of cluster sampling in real life?

One of the common example of cluster is the schools. Schools are clusters, and selecting random schools is sampling, from those selected schools we collect data from all the students, this is data collection. Here, the schools are clusters, we choose two or three schools which is sampling, and we collect data from those selected schools which is data collection.


Article Tags :