How to Calculate Correlation By Group in R

Last Updated : 15 Mar, 2024

Calculating correlation by group in R Programming Language involves finding the correlation coefficient between two variables within each subgroup defined by another variable. In R, correlation by group can be achieved by using the cor() function along with other functions like group_by() from the ‘dplyr’ package or aggregate() function.

Syntax:

library(dplyr)

correlation_data <- dataset %>%

group_by(grouping_variable) %>%

summarise(correlation = cor(variable1, variable2))

‘dataset’: This is the name of your dataset containing the variables of interest.
‘group_by(grouping_variable)’: This function groups the dataset by the variable that defines your groups.
‘summarise(correlation = cor(variable1, variable2))’: Within each group defined by the grouping variable, this line calculates the correlation coefficient between two variables of interest (variable1 and variable2) using the cor() function. The result is stored in a new column named ‘correlation’.
‘correlation_data’: This is the resulting dataframe containing the correlation coefficients for each group.

Replace dataset, grouping_variable, variable1, and variable2 with the actual names of your dataset and variables.

Correlation is used for

Identifying Group-Specific Relationships: Correlation by group helps identify how the relationship between two variables varies across different subsets or groups in your data.
Application in Various Fields: It’s commonly used in fields like marketing, biology, social sciences, finance, and education to analyse diverse datasets.
Insight into Group Dynamics: By calculating correlations within each group, you gain insights into specific trends or relationships that might be obscured when looking at the entire dataset.
Nuanced Analysis: Allows for a more nuanced analysis by considering the unique characteristics of each group within your data.
Enhanced Decision Making: Helps in making informed decisions tailored to specific groups or contexts within your dataset.

Calculate Correlation By Group Using Simulated Data

# Generate some sample data
set.seed(123)
df <- data.frame(
  group = rep(letters[1:3], each = 20),
  x = rnorm(60),
  y = rnorm(60)
)

# Using dplyr
library(dplyr)
correlation_by_group <- df %>%
  group_by(group) %>%
  summarise(correlation = cor(x, y))

# Print the result
print(correlation_by_group)

Output:

# A tibble: 3 × 2
  group correlation
  <chr>       <dbl>
1 a          0.122 
2 b          0.366 
3 c         -0.0242

First we generate a sample dataset with three columns: ‘group’, ‘x’, and ‘y’

Using dplyr, we group the data by the ‘group’ column and calculate the correlation between ‘x’ and ‘y’ within each group.
Using aggregate(), we achieve the same result by grouping the data by ‘group’ and applying the cor() function to ‘x’ and ‘y’.

# Load the required library
library(dplyr)

# Example dataset
data <- data.frame(
  group = c("A", "A", "B", "B", "B", "C", "C", "C"),
  var1 = c(1, 2, 3, 4, 5, 6, 7, 8),
  var2 = c(2, 4, 3, 6, 5, 8, 7, 9)
)

# Calculate correlation by group
correlation <- data %>%
  group_by(group) %>%
  summarise(correlation = cor(var1, var2))

# View the result
print(correlation)

Output:

# A tibble: 3 × 2
  group correlation
  <chr>       <dbl>
1 A           1    
2 B           0.655
3 C           0.5

Calculate Correlation by Group

‘data %>% group_by(group) %>% summarise(correlation = cor(var1, var2))’: This line performs the following operations:
‘%>%’: The pipe operator, which takes the output from the left-hand side and passes it as the first argument to the function on the right-hand side.
‘group_by(group)’: Groups the dataset by the ‘group’ variable.
‘summarise(correlation = cor(var1, var2))’: Within each group, calculates the correlation coefficient between ‘var1’ and ‘var2’ using the cor() function. The result is stored in a new column named ‘correlation’.

Calculate Correlation By Group Using Real Data

It is the example of how to calculate correlation by group in R using the ‘mtcars’ dataset(already available by default in R), which contains data about various car models. We’ll calculate the correlation between the variables ‘mpg’ (miles per gallon) and ‘hp’ (horsepower) for different levels of ‘cyl’ (number of cylinders):

# View the first few rows of the mtcars dataset
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now calculate the Correlation By Group

# Load the required library
library(dplyr)

# Calculate correlation by group
correlation_by_group <- mtcars %>%
  group_by(cyl) %>%
  summarise(correlation = cor(mpg, hp))

# View the result
print(correlation_by_group)

Output:

# A tibble: 3 × 2
    cyl correlation
  <dbl>       <dbl>
1     4      -0.524
2     6      -0.127
3     8      -0.284

We are using the ‘mtcars’ dataset which is available by default in R.

We are grouping the dataset by the ‘cyl’ variable, which represents the number of cylinders in each car.
Within each group (each level of ‘cyl’), we calculate the correlation coefficient between ‘mpg’ and ‘hp’ using the ‘cor()’ function.
The resulting ‘dataframe correlation_by_group’ contains the correlation coefficients for each level of ‘cyl’, indicating the correlation between miles per gallon and horsepower for cars with different numbers of cylinders.

Conclusion

In summary, calculating correlation by group in R allows for a understanding of how the relationship between variables varies across different subgroups. Using the `dplyr` package, analysts can efficiently compute correlation coefficients within each group, revealing insights tailored to specific categories or subpopulations within the data.

Suggest improvement

How to Calculate Cross Correlation in R?

Share your thoughts in the comments