Open In App

Systematic Sampling in R

Last Updated : 21 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Sampling is a method used in research to gather information about a population by selecting a subset, or sample, of individuals or items from that population. Instead of studying every single member of the population, researchers collect data from a smaller group that represents the whole. Sampling is a powerful tool used in various fields to understand, analyze, and create.

What is Systematic Sampling?

Systematic sampling is a statistical sampling method where elements from a larger population are selected at regular intervals with a fixed sampling interval. The process involves selecting every kth element from a list after a random start, where k is the sampling interval.

For example, if a teacher wanted to sample 100 students from a school with 1000 students using systematic sampling, then the teacher would select every 10th student from a list sorted by, say, student ID numbers.

Step 1: Determine the Population Size (N)

Identify the total number of elements in the population that we want to sample from.

Step 2: Calculate the Sampling Interval (k)

Decide on the sampling interval, which represents the gap between selected elements. The sampling interval (k) is calculated as N/sample size, where the sample size is the number of elements that want to be sample.

𝒌=𝑵/Sample Size

Step 3: Random Start

Choose a random starting point between 1 and k. This starting point determines which element will be the first in the sample.

Step 4: Select Systematic Sample

Start from the randomly chosen point and select every kth element until reach the end of the population. Systematic sampling is easy to implement and is more efficient than simple random sampling in certain situations.

Systematic Sampling in R

Systematic sampling is a technique used in statistics to select a sample from a larger population at regular intervals. In R Programming Language we can implement systematic sampling using ‘seq()’ function.

R
# Example data: population
population <- 1:100  # Assuming 100 individuals in the population

# Sample size
sample_size <- 10  # Desired sample size

# Calculate sampling interval
interval <- ceiling(length(population) / sample_size)

# Generate a sequence of starting points
start <- seq(from = 1, by = interval, length.out = sample_size)

# Select the sample using systematic sampling
systematic_sample <- population[start]

# Print the systematic sample
print(systematic_sample)

Output:

 [1]  1 11 21 31 41 51 61 71 81 91

A population consisting of numbers from 1 to 100.

  • We want to obtain a systematic sample of size 10.
  • Then calculate the sampling interval by dividing the population size by the sample size and rounding up to the nearest integer.
  • Then use the seq() function to generate a sequence of starting points at regular intervals.
  • Finally select elements from the population using these starting points to obtain the systematic sample.

The output [1] 1 11 21 31 41 51 61 71 81 91 represents the systematic sample obtained from the population using the specified parameters.

  • 1 is the first element of the systematic sample. It corresponds to the first element of the population.
  • 11 is the second element of the systematic sample. It corresponds to the element in the population that is 10 positions away from the first element.
  • Similarly, 21, 31, 41, 51, 61, 71, 81, and 91 are the subsequent elements of the systematic sample, each obtained by adding the sampling interval to the previous selected element.

Systematic Sampling on mtcars dataset

Here we use the built-in ‘mtcars’ dataset in R. The ‘mtcars’ dataset contains information about various car models.

R
# Load the mtcars dataset
data(mtcars)

# Display the first few rows of the mtcars dataset
head(mtcars)

# Define the sample size and calculate the sampling interval
sample_size <- 5
sampling_interval <- ceiling(nrow(mtcars) / sample_size)

# Randomly select a starting point
random_start <- sample(1:sampling_interval, 1)

# Perform systematic sampling
systematic_sample_indices <- seq(from = random_start, to = nrow(mtcars), 
                                 by = sampling_interval)
systematic_sample <- mtcars[systematic_sample_indices, ]

# Display the systematic sample
print(systematic_sample)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

systematic_sample

mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

First load the mtcars dataset.

  • Define the sample size (sample_size) and calculate the sampling interval (sampling_interval) based on the total number of rows in the mtcars dataset.
  • Randomly select a starting point (random_start) within the first sampling_interval observations.
  • Using the seq() function, we generate a sequence of indices for systematic sampling, starting from the random start point and incrementing by the sampling interval.
  • We extract the systematic sample from the mtcars dataset using the calculated indices.
R
# Set seed for reproducibility
set.seed(123)

# Create simple function to generate random first names
randomNames <- function(n = 5000) {
  do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
}

# Create data frame
students <- data.frame(first_name = randomNames(500),
                       exam_score = round(rnorm(500, mean = 75, sd = 5), 1))

# View first six rows of data frame
head(students)

# Define function to obtain systematic sample
obtain_sys <- function(N, n) {
  k <- ceiling(N / n)
  r <- sample(1:k, 1)
  seq(r, r + k * (n - 1), k)
}

# Obtain systematic sample of size 10
sys_sample_students <- students[obtain_sys(nrow(students), 10), ]

# View first six rows of systematic sample
head(sys_sample_students)

# View dimensions of systematic sample
dim(sys_sample_students)

Output:

# View first six rows of data frame
first_name exam_score
1 onxel 82.9
2 snmqc 74.7
3 nfwxh 78.1
4 crlpw 67.3
5 jbhim 74.4
6 rxfkp 74.9

# View first six rows of systematic sample
first_name exam_score
40 jgozy 71.0
90 pygyq 74.2
140 unoio 71.0
190 mzgwl 70.1
240 iucgb 76.0
290 mvngl 74.8

# View dimensions of systematic sample
[1] 10 2

Set a seed for reproducibility.

  • Then creates a function ‘randomNames’ to generate random first names, consisting of five lowercase letters each.
  • A data frame named ‘students’ is created, containing 500 rows of randomly generated first names and corresponding exam scores.
  • The ‘head()’ function is used to display the first six rows of the `students` data frame.
  • A function named ‘obtain_sys’ is defined to obtain a systematic sample from a given population size and desired sample size.
  • A systematic sample of size 10 is obtained from the ‘students’ data frame using the ‘obtain_sys’ function.
  • The ‘head()’ function is used to display the first six rows of the systematic sample.
  • The dimensions of the systematic sample data frame are displayed using the ‘dim()’ function.

Uses of Systematic Sampling

  1. Large Populations: When dealing with a large population, it can be challenging and expensive to conduct a simple random sample. Systematic sampling provides a more practical and efficient way to obtain a representative sample by selecting every kth element.
  2. Efficiency: Systematic sampling is often more efficient than simple random sampling. It requires less effort and resources, making it a suitable choice when time and budget constraints are significant considerations.
  3. Homogeneous Population: If the population is relatively homogeneous and there is no significant order or pattern in the data, systematic sampling can give representative results.
  4. Regular Data Collection: In situations where data is collected at regular intervals, systematic sampling can align with the natural order of the data collection process. This can simplify the sampling procedure and make it more practical.

Limitations of Systematic Sampling

  1. Bias Risk: Systematic sampling may introduce bias if there’s a hidden pattern or periodicity in the population aligned with the sampling interval.
  2. Skewed Representation: It can lead to skewed representation if the sampling interval coincides with certain characteristics, causing under or overrepresentation.
  3. Dependency on Ordering: The effectiveness relies on the order of elements; specific arrangements may affect representativeness.
  4. Sensitivity to Outliers: Outliers can have a significant impact, especially if they are consistently spaced based on the sampling interval.
  5. Inapplicability for Unordered Populations: Not suitable for populations without a clear order or listing.
  6. Complexity in Unequal Probability: Adjusting for unequal probabilities can add complexity, potentially negating the simplicity of systematic sampling.

Conclusion

In summary, systematic sampling in R is a straightforward and efficient method suitable for ordered populations. It’s easy to implement and resource-efficient for large datasets. However, caution is needed to avoid biases caused by hidden patterns aligned with the sampling interval. While offering simplicity and practicality, systematic sampling may not be ideal for all scenarios, and researchers should be mindful of its limitations.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads