How to Use aggregate and Not Drop Rows with NA in R
Last Updated :
01 Mar, 2024
In R Programming Language the aggregate()
function is used to compute summary statistics by group. By default, aggregate()
drop any rows with missing values (NA) in the grouping columns. However, we can specify the argument na.action = na.pass
to retain rows with NA values during aggregation.
Let us study in detail about how to use aggregate & Not Drop Rows with NA in R
Syntax:
aggregate(formula, data, FUN, na.action = na.pass)
Where:
formula
: A formula specifying the variables to be aggregated and the grouping variable(s).
data
: The data frame containing the variables.
FUN
: The function to be applied for aggregation (e.g., mean
, sum
, max
, etc.).
na.action
: Specifies how to handle NA values. Setting na.action = na.pass
retains rows with NA values during aggregation.
Aggregating with Sum
In this example, we have a dataset containing two columns: “Group” and “Value” and we will aggregate the sum of “Value” by “Group”, and retain rows with NA values during aggregation.
R
df1 <- data.frame (Group = c ( "A" , "B" , "A" , "B" , NA ),
Value = c ( NA , 2, NA , 4, 5))
result1 <- aggregate (Value ~ Group, data = df1, FUN = sum, na.action = na.pass)
print (result1)
|
Output:
Group Value
1 A NA
2 B 6
Aggregating with Custom Function
In this example, we want to find the median of “Rating” within each “Group” in a dataset df with two columns: “Group” and “Rating”.Here we apply a custom function to compute the median of “Rating” within each “Group”, ensuring that rows with NA values are not dropped during aggregation.
R
df4 <- data.frame (Group = c ( "A" , "B" , "A" , "B" , NA ),
Rating = c (3.5, 4.2, NA , 3.8, 4.5))
median_custom <- function (x) {
median (x, na.rm = TRUE )
}
result4 <- aggregate (Rating ~ Group, data = df4, FUN = median_custom,
na.action = na.pass)
print (result4)
|
Output:
Group Rating
1 A 3.5
2 B 4.0
Aggregating with Count
In this example we want to count the number of purchases made by each customer, ensuring that rows with NA values are retained during aggregation.
R
customer_data <- data.frame (
Customer = c ( 'Jayesh' , 'Anurag' , 'Vipul' , 'Shivang' , 'Pratham' ),
Purchases = c (5, 8, NA , 12, NA ),
Returns = c ( NA , 2, 1, NA , 3)
)
aggregate (. ~ Customer, data = customer_data, FUN = function (x) sum (! is.na (x)),
na.action = na.pass)
|
Output:
Customer Purchases Returns
1 Anurag 1 1
2 Jayesh 1 0
3 Pratham 0 1
4 Shivang 1 0
5 Vipul 0 1
Aggregating with Mean
In this example, we calculate the mean score for each student in the subjects while ensuring that rows with NA values are retained during aggregation. The na.action = na.pass argument allows us to include NA values in the calculation of the mean score for each subject.
R
student_scores <- data.frame (
Student = c ( 'Jayesh' , 'Anurag' , 'Vipul' , 'Shivang' , 'Pratham' ),
Math = c (80, NA , 75, 90, 85),
Science = c ( NA , 70, 85, 88, 92),
English = c (78, 85, 82, NA , 90)
)
aggregate (. ~ Student, data = student_scores, FUN = mean, na.action = na.pass)
|
Output:
Student Math Science English
1 Anurag NA 70 85
2 Jayesh 80 NA 78
3 Pratham 85 92 90
4 Shivang 90 88 NA
5 Vipul 75 85 82
Conclusion
In this article we understood that the aggregate() function is a powerful tool for computing summary statistics by group. By default, aggregate() drops any rows containing missing values (NA) in the grouping columns, which may lead to inaccurate analyses. However, by specifying na.action = na.pass, we can retain rows with NA values during aggregation, ensuring a more comprehensive analysis.
Share your thoughts in the comments
Please Login to comment...