Open In App

Plyr Package in R Programming

Last Updated : 23 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Data analysis is an essential part of any research or business process, and R is a popular programming language used for data analysis and statistical computing. One of the significant advantages of R Programming is the availability of packages that can be used to perform complex data manipulation tasks. Plyr is one such package that can be used to manipulate data efficiently. In the following sections, we will explore the features of the plyr package and how it can be used for data manipulation.

What is Plyr Package?

Plyr is a package for data manipulation in R that provides a set of functions for splitting, applying, and combining data. It is based on the concept of split-apply-combine, where a dataset is first split into smaller subsets, a function is applied to each subset, and the results are then combined into a single output. This process is useful for tasks such as aggregating data, summarizing data, and transforming data.

Installing and Loading Plyr Package:

Before using the plyr package, it needs to be installed and loaded into R. The package can be installed using the following command:

R




install.packages("plyr")


After the package is installed, it can be loaded into R using the following command:

R




library(plyr)


1. Splitting Data using ddply( ) functions:

The ddply( ) function is a powerful tool for splitting data frames into smaller subsets, applying a function to each subset, and then combining the results into a new data frame. The name “ddply” stands for “split, apply, and combine”, which summarizes the three main steps of the function. Here are the main arguments of ddply():

Syntax: 

Parameters:  `data`

object:The input data frame that you want to split and process.

Syntax: 

Parameters:  `variables`

object:One or more grouping variables that define how the data should be split.

Syntax: 

Parameters:  `fun`

object:A function that you want to apply to each subset of the data frame.

Syntax: 

Parameters:  `…`

object:Additional arguments that are passed to the function specified in fun.ere’s an example of how to use ddply() to calculate the mean miles per gallon (mpg) of cars in the mtcars dataset, grouped by the number of cylinders in the engine:

R




library(plyr)
 
# Using ddply to group by number of cylinders and calculate mean mpg
ddply(mtcars, .(cyl), summarise, mean_mpg = mean(mpg))


Output:

 

In this example, ddply() is used to group the mtcars dataset by the cyl variable (number of cylinders), and then the summarise() function is used to calculate the mean mpg for each group. The resulting output is a data frame with two columns: cyl and mean_mpg.

2. Combining the results using ldply( ) function:

The ldply() function is used to convert a list of data frames or vectors into a single data frame, with each element of the list becoming a row of the output data frame. The name “ldply” stands for “list and bind data frames”, which summarizes the main action of the function. Finally, the ldply() function returns a data frame that contains all the elements of the input list, stacked on top of each other. Here are the main arguments of ldply():

Syntax: 

Parameters:  `data`

object:The input list that you want to convert to a data frame.

Syntax: 

Parameters:  `.fun`

object:An optional function that you want to apply to each element of the list before converting it to a data frame.

Syntax: 

Parameters:  `…`

object:Additional arguments that are passed to the function specified in .fun.

Example:

R




library(plyr)
 
# Create a list of data frames
countries_1 <- data.frame(country = c("USA", "Canada", "Mexico"), population = c(328, 37, 130))
countries_2 <- data.frame(country = c("Brazil", "Argentina", "Chile"), population = c(211, 45, 19))
countries_list <- list(countries_1, countries_2)
 
# Use ldply() to combine the list of data frames into a single data frame
combined_df <- ldply(countries_list, data.frame)
 
# View the resulting data frame
combined_df


Output:

 

In this example, we first create a list of two data frames (countries_1 and countries_2) using data.frame() function. Then, we combine these data frames into a list called countries_list. Finally, we use ldply() function to combine all the data frames in countries_list into a single data frame called combined_df. The resulting data frame contains information about all the countries in the original data frames.

3. Combining Data using adply( ) function:

The adply() function is used to apply a function to each subset of a data frame and then combines the results into a new data frame. The a in adply() stands for “array”, meaning that it can be used with arrays of any dimensions. The arguments for adply() are:

Syntax: 

Parameters:  `data`

object:the input data frame or array.

Syntax: 

Parameters:  `margins`

object:the dimensions of the array to split over (in this example, we used 2 to split over the second dimension)

Syntax: 

Parameters:  `FUN`

object:the function to apply to each subset of the array (in this example, we used an anonymous function that calculates the mean and standard deviation of each column)

Syntax: 

Parameters:  `…`

object:additional arguments to pass to the function specified in FUN (if any)

Example:

R




library(plyr)
 
# Create a sample matrix
mat <- matrix(1:9, nrow = 3)
 
# Display created matrix
mat
 
# Use adply() to calculate the sum of each row
result <- adply(mat, 1, function(x) sum(x))
 
# View the result
result


Output:

 

In this example, the adply() function is used to apply the sum() function to each row of the matrix mat. The second argument (1) specifies that we want to apply the function to each subset of the array consisting of one row and all columns. The third argument is an anonymous function that calculates the sum of each row. The resulting result data frame has one column and three rows (one for each row in mat). The values in each row correspond to the sum of that row.

4. Join Two Data Frames using join( ) function:

join() is a function from the plyr package in R that is used to join two data frames by a common column. The join() function takes several arguments, including:

Syntax: 

Parameters:  `x`, `y`

object: Data frames join.

Syntax: 

Parameters:  `by`

object: The column(s) to join the data frames .

Syntax: 

Parameters:  `type`

object: The type of join to perform (e.g. “inner”, “outer”, “left”, “right”).

Syntax: 

Parameters:  `suffix`

object:A character vector to append to overlapping variable names (defaults to c(“.x”, “.y”))

Example:

R




library(plyr)
 
# Create two sample data frames
df1 <- data.frame(
  id = c(1, 2, 3),
  name = c("Alice", "Bob", "Charlie")
)
 
df2 <- data.frame(
  id = c(2, 3, 4),
  age = c(25, 30, 35)
)
 
# Print the created dataset
df1
df2
 
# Use join() to combine the data frames
result <- join(df1, df2, by = "id")
 
# View the result
result


Output:

 

In this example, the join() function is used to combine two data frames (df1 and df2) based on a common column (id). The by argument specifies the name of the common column. The resulting result data frame has three columns (id, name, age) and two rows (one for each matching value of id in df1 and df2). The values in the name and age columns correspond to the names and ages of the individuals with the matching id value.

5. Summary Statistics using summarise( ) function:

The summarise() function in the plyr package of R is used to aggregate data and calculate summary statistics by groups. The summarise() function takes several arguments, including:

Syntax: 

Parameters:  `data`

object: The data frame to summarize.

Syntax: 

Parameters:  `…`

object:  a list of expressions that calculate summary statistics (e.g. mean(value), sd(value), etc.)

Example:

R




# Load the plyr package
library(plyr)
 
# Create a data frame with two columns: group and value
df <- data.frame(group = c("A", "A", "B", "B", "B"), value = c(2, 4, 6, 8, 10))
 
# Summarize the data by group, calculating the
# mean and standard deviation of the value column
summary_df <- summarise(group_by(df, group), mean = mean(value), sd = sd(value))
 
# Print the summary data frame to the console
summary_df


Output:

 

In this code, We first use the group_by() function from plyr to group the data by the group column, and pass the resulting grouped data frame to the summarise() function from plyr. We calculate the mean and standard deviation of the value column using the mean() and sd() functions, respectively, and give the resulting columns the names mean and sd. The resulting summary_df data frame will have a row for each group in the original df data frame, with columns for group, mean, and sd.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads