Take random sample based on groups in R

Last Updated : 18 Jul, 2021

R programming language provides us with many packages to take random samples from data objects, data frames, or data tables and aggregate them into groups.

Method 1: Using plyr library

The “plyr” library can be installed and loaded into the working space which is used to perform data manipulation and statistics. The ddply() method is applied for each subset of the specified data frame, followed by combining the results into a data frame.

Syntax:

ddply( .data, .variables, .fun = NULL)

Parameter –

data – The data frame to use

variables – the grouping parameters

fun – the function to be applied. In this case, sample(nrow(x),y) method is applied which extracts y rows of each group from the variables chosen for the second parameter of ddply() method.

Example:

R

# importing required libraries 
library("plyr") 
  
# create dataframe 
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50),rep('G3',50)),  
                col2=rep(letters[1:5],30) 
                ) 
  
print("Original DataFrame") 
head(data_frame) 
  
# pick 3 samples of each from data frame 
data_mod <- ddply(data_frame,.(col1),function(x) x[sample(nrow(x),5),]) 
print("Modified DataFrame") 
print (data_mod)

Output

[1] "Original DataFrame" 
  col1 col2 
1   G1    a 
2   G1    b 
3   G1    c 
4   G1    d 
5   G1    e 
6   G1    a 
[1] "Modified DataFrame" 
   col1 col2 
1    G1    d 
2    G1    e 
3    G1    d 
4    G1    a 
5    G1    a 
6    G2    b 
7    G2    c 
8    G2    d 
9    G2    d 
10   G2    e 
11   G3    c 
12   G3    e 
13   G3    b 
14   G3    b 
15   G3    d

Method 2: Using dplyr library

The “dplyr” library can be installed and loaded into the working space which is used to perform data manipulation. This package allows a large variety of methods to filter, subset, and extract data based on the application of constraints and conditions. The data frame is subjected to multiple operations using the pipe operator.

The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names.

Syntax:

group_by(col1, col2, …)

This is followed by the application of sample_n() method is used to select random rows from the data frame with the argument indicating the number of rows to sample out from each group.

Example:

R

# importing required libraries 
library("dplyr") 
  
# create dataframe 
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50), 
                              rep('G3',50)),  
                col2=rep(letters[1:5],30) 
                ) 
  
print("Original DataFrame") 
head(data_frame) 
  
# pick 3 samples of each from data frame 
data_mod <- data_frame %>% group_by(col1) %>% sample_n(3) 
print("Modified DataFrame") 
print (data_mod)

Output

[1] "Original DataFrame" 
  col1 col2 
1   G1    a 
2   G1    b 
3   G1    c 
4   G1    d 
5   G1    e 
6   G1    a 
[1] "Modified DataFrame" 
# A tibble: 9 x 2 
# Groups:   col1 [3]   
 col1  col2    
<chr> <chr> 
1 G1    d     
2 G1    e     
3 G1    c     
4 G2    a     
5 G2    a     
6 G2    c     
7 G3    b     
8 G3    a     
9 G3    a

Method 3: Using data.table

The library data.table can be used for the fast aggregation of large data organized into tabular structures. The package can be loaded and installed into the working space.

The indexing of the data table can be performed using the .SD parameter which selects a sample grouping data using the “by” parameter. The number of rows chosen from each group depends on the size attribute specified in the indexing method. The output is returned in the form of a data.table.

Syntax:

data_frame[ , .SD[sample(x = .N, size = n)], by = ]

Example:

R

# importing required libraries 
library("data.table") 
  
# create dataframe 
data_frame<-data.table(col1=c(rep('G1',50),rep('G2',50), 
                              rep('G3',50)),  
                col2=rep(letters[1:5],30) 
                ) 
  
print("Original DataFrame") 
head(data_frame) 
  
# pick 3 samples of each from data frame 
data_mod <- data_frame[, .SD[sample(x = .N, size = 5)], by = col1] 
print("Modified DataFrame") 
print (data_mod)

Output

[1] "Original DataFrame" 
col1 col2 
1:   G1    a 
2:   G1    b 
3:   G1    c 
4:   G1    d 
5:   G1    e 
6:   G1    a 
[1] "Modified DataFrame" 
col1 col2  
1:   G1    a  
2:   G1    e  
3:   G1    d  
4:   G1    e  
5:   G1    a  
6:   G2    c  
7:   G2    c  
8:   G2    c  
9:   G2    d 
10:   G2    e 
11:   G3    b 
12:   G3    e 
13:   G3    d 
14:   G3    d 
15:   G3    d

Suggest improvement

round_any() Function of plyr Package in R

Chi-Square Distribution in R

Share your thoughts in the comments

Take random sample based on groups in R

Method 1: Using plyr library

R

Method 2: Using dplyr library

R

Method 3: Using data.table

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?