Pearson Correlation Testing in R Programming

Last Updated : 19 Mar, 2024

Correlation is a statistical measure that indicates how strongly two variables are related. It involves the relationship between multiple variables as well. For instance, if one is interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question. Generally, it lies between -1 and +1. It is a scaled version of covariance and provides the direction and strength of a relationship. Correlation coefficient test in R

Pearson Correlation Testing in R

There are mainly two types of correlation:

Parametric Correlation – Pearson correlation(r): It measures a linear dependence between two variables (x and y) is known as a parametric correlation test because it depends on the distribution of the data.
Non-Parametric Correlation – Kendall(tau) and Spearman(rho): They are rank-based correlation coefficients, and are known as non-parametric correlation.

Pearson Rank Correlation Coefficient Formula

Pearson Rank Correlation is a parametric correlation. The Pearson correlation coefficient is probably the most widely used measure for linear relationships between two normal distributed variables and thus often just called “correlation coefficient”. The formula for calculating the Pearson Rank Correlation is as follows:

[Tex]{{\displaystyle r = \frac { \Sigma(x – m_x)(y – m_y) }{\sqrt{\Sigma(x – m_x)^2 \Sigma(y – m_y)^2}} [/Tex]

where,

r: pearson correlation coefficient
x and y: two vectors of length n
m_x and m_y: corresponds to the means of x and y, respectively.

Note:

r takes a value between -1 (negative correlation) and 1 (positive correlation).
r = 0 means no correlation.
Can not be applied to ordinal variables.
The sample size should be moderate (20-30) for good estimation.
Outliers can lead to misleading values means not robust with outliers.

Implementation in R

R Programming Language provides two methods to calculate the pearson correlation coefficient. By using the functions cor() or cor.test() it can be calculated. It can be noted that cor() computes the correlation coefficient whereas cor.test() computes the test for association or correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.

Syntax: cor(x, y, method = “pearson”)
cor.test(x, y, method = “pearson”)

Parameters:

x, y: numeric vectors with the same length
method: correlation method

Correlation Coefficient Test In R Using cor() method

# R program to illustrate
# pearson Correlation Testing
# Using cor()

# Taking two numeric
# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)

# Calculating
# Correlation coefficient
# Using cor() method
result = cor(x, y, method = "pearson")

# Print the result
cat("Pearson correlation coefficient is:", result)

Output:

Pearson correlation coefficient is: 0.5357143

Correlation Coefficient Test In R Using cor.test() method

# R program to illustrate
# pearson Correlation Testing
# Using cor.test()

# Taking two numeric
# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)

# Calculating
# Correlation coefficient
# Using cor.test() method
result = cor.test(x, y, method = "pearson")

# Print the result
print(result)

Output:

Pearson's product-moment correlation

data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143

In the output above:

T is the value of the test statistic (T = 1.4186)
p-value is the significance level of the test statistic (p-value = 0.2152).
alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0).
sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (Cor.coeff = 0.5357).

Correlation Coefficient Test on External Dataset

Data: Download the CSV file here.

# R program to illustrate
# Pearson Correlation Testing

# Import data into RStudio
df = read.csv("Auto.csv")

# Taking two column
# Vectors with same length
x = df$mpg
y = df$weight

# Calculating
# Correlation coefficient
# Using cor() method
result = cor(x, y, method = "pearson")

# Print the result
cat("Person correlation coefficient is:", result)

# Using cor.test() method
res = cor.test(x, y, method = "pearson")
print(res)

Output:

Person correlation coefficient is: -0.8782815
Pearson's product-moment correlation

data: x and y
t = -31.709, df = 298, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9018288 -0.8495329
sample estimates:
cor
-0.8782815

Visualize Pearson Correlation Testing in R Programming

library(ggplot2)

# Scatter plot with correlation coefficient
ggplot(data = df, aes(x = weight, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  annotate("text", x = mean(df$weight), y = max(df$mpg), 
           label = paste("Correlation =", round(correlation, 2)), 
           color = "red", hjust = 0, vjust = 1) +
  labs(title = "Scatter Plot of MPG vs. Weight with Correlation Coefficient", 
       x = "Weight", y = "MPG") +
  theme_minimal()

Output:

Pearson Correlation Testing in R Programming

In this code The geom_smooth() function with method = "lm" fits a linear model to the data. in the scatter plot calculated Pearson correlation coefficient. Adjust the position and appearance of the text as needed. The color of the annotation text is set to red for visibility. The resulting plot will give you both a visual representation of the relationship and the numeric correlation coefficient.

Frequently Asked Question

Q.1 Why pearson correlation test is required?

The Pearson correlation test is a valuable statistical tool for assessing and understanding relationships between variables, guiding decision-making, and ensuring the validity of statistical analyses in various fields.

Quantify Strength and Direction of Linear Relationships
Test Hypotheses about Relationships
Decision Making in Research and Analysis
Feature Selection in Data Analysis
Assumption Checking in Regression Analysis
Quality Control in Scientific Studies
Risk Assessment in Finance

Q.2 How do you interpret a positive correlation coefficient?

A positive correlation coefficient suggests that when one variable goes up, the other variable also tends to go up. In simpler terms, it means there is a positive association between the two variables—when one increases, the other generally increases as well.

Q.3 In a Pearson correlation test, what is the null hypothesis?

The null hypothesis states that there is no significant linear correlation between the two variables in the population.

Suggest improvement

Kendall Correlation Testing in R Programming

Share your thoughts in the comments

Pearson Correlation Testing in R Programming