Take random sample based on groups in R
R programming language provides us with many packages to take random samples from data objects, data frames, or data tables and aggregate them into groups.
Method 1: Using plyr library
The “plyr” library can be installed and loaded into the working space which is used to perform data manipulation and statistics. The ddply() method is applied for each subset of the specified data frame, followed by combining the results into a data frame.
ddply( .data, .variables, .fun = NULL)
data – The data frame to use
variables – the grouping parameters
fun – the function to be applied. In this case, sample(nrow(x),y) method is applied which extracts y rows of each group from the variables chosen for the second parameter of ddply() method.
 "Original DataFrame" col1 col2 1 G1 a 2 G1 b 3 G1 c 4 G1 d 5 G1 e 6 G1 a  "Modified DataFrame" col1 col2 1 G1 d 2 G1 e 3 G1 d 4 G1 a 5 G1 a 6 G2 b 7 G2 c 8 G2 d 9 G2 d 10 G2 e 11 G3 c 12 G3 e 13 G3 b 14 G3 b 15 G3 d
Method 2: Using dplyr library
The “dplyr” library can be installed and loaded into the working space which is used to perform data manipulation. This package allows a large variety of methods to filter, subset, and extract data based on the application of constraints and conditions. The data frame is subjected to multiple operations using the pipe operator.
The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names.
group_by(col1, col2, …)
This is followed by the application of sample_n() method is used to select random rows from the data frame with the argument indicating the number of rows to sample out from each group.
 "Original DataFrame" col1 col2 1 G1 a 2 G1 b 3 G1 c 4 G1 d 5 G1 e 6 G1 a  "Modified DataFrame" # A tibble: 9 x 2 # Groups: col1  col1 col2 <chr> <chr> 1 G1 d 2 G1 e 3 G1 c 4 G2 a 5 G2 a 6 G2 c 7 G3 b 8 G3 a 9 G3 a
Method 3: Using data.table
The library data.table can be used for the fast aggregation of large data organized into tabular structures. The package can be loaded and installed into the working space.
The indexing of the data table can be performed using the .SD parameter which selects a sample grouping data using the “by” parameter. The number of rows chosen from each group depends on the size attribute specified in the indexing method. The output is returned in the form of a data.table.
data_frame[ , .SD[sample(x = .N, size = n)], by = ]
 "Original DataFrame" col1 col2 1: G1 a 2: G1 b 3: G1 c 4: G1 d 5: G1 e 6: G1 a  "Modified DataFrame" col1 col2 1: G1 a 2: G1 e 3: G1 d 4: G1 e 5: G1 a 6: G2 c 7: G2 c 8: G2 c 9: G2 d 10: G2 e 11: G3 b 12: G3 e 13: G3 d 14: G3 d 15: G3 d