Spearman’s Rank Correlation
Last Updated :
23 Jan, 2024
Correlation measures the strength of the association between two variables. For instance, if we are interested in knowing whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question. To learn more about correlation, please refer to this.
Methods for correlation analysis: There are mainly two types of correlation:
- Parametric Correlation: static Pearson correlation(r): It measures a linear dependence between two variables (x and y) and is known as a parametric correlation test because it depends on the distribution of the data. It is used for numerical data.
- Non-Parametric Correlation – Kendall(tau) and Spearman(rho): They are rank-based correlation coefficients, and are known as non-parametric correlation.It is used for categorical data.
What is Spearman’s Correlation
Spearman’s Rank Correlation is a statistical measure of the strength and direction of the monotonic relationship between two continuous variables. Therefore, these attributes are ranked or put in the order of their preference. It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.
Spearman’s Correlation formula
where,
= Spearman Correlation coefficient
rank = the position or order of a variable’s value relative to other values within a dataset
di = the difference in the ranks given to the two variables values for each item of the data
n = total number of observation
Compute Spearman’s Rank Correlation Stepwise
Converting the original data into ranks:
Creating ranks involves assigning a numerical order to the values in a dataset, where the smallest value gets the rank of 1, the second smallest gets the rank of 2, and so on.
Data:
|
7
|
5
|
6
|
4
|
4
|
5
|
5
|
6
|
8
|
10
|
7
|
7
|
10
|
9
|
3
|
2
|
9
|
8
|
2
|
1
|
Creating Ranks for X1:
- Sort the values of X1 in ascending order:
2, 3, 4, 5, 6, 7, 7, 8, 9, 10
. - Assign ranks based on the sorted order:
1, 2, 3, 4, 5,6.5, 6.5, 8, 9, 10
. Since there are two tied values (6 and 7), their average rank is assigned (6.5).
Note: If numbers are tied, their average of their ranks are considered.
Doing the same for Y1 we get:
|
6.5
|
4.5
|
5
|
3
|
3
|
4.5
|
4
|
6
|
8
|
10
|
6.5
|
7
|
10
|
9
|
2
|
2
|
9
|
8
|
1
|
1
|
Spearman’s Correlation Claculations :
In Spearman’s rank correlation, the process involves converting the original data into ranks. This is done to assess the monotonic relationship between two variables without relying on the specific numerical values of the data points.
Let’s consider taking 10 different data points in variables X1 and Y1. Then follow the steps:
- Arrange the values in ascending order, from the smallest to the largest.
- Assign ranks to each value based on its position in the sorted order. The smallest value gets a rank of 1, the second smallest gets a rank of 2, and so on.
- Then find out the square of the difference in the ranks given to the two variables values for each item of the data.
|
X1 | 7 | 6 | 4 | 5 | 8 | 7 | 10 | 3 | 9 | 2 |
Y1 | 5 | 4 | 5 | 6 | 10 | 7 | 9 | 2 | 8 | 1 |
Rank X1 | 6.5 | 5 | 3 | 4 | 8 | 6.5 | 10 | 2 | 9 | 1 |
Rank Y1 | 4.5 | 3 | 4.5 | 6 | 10 | 7 | 9 | 2 | 8 | 1 |
d2 | 4 | 4 | 2.25 | 4 | 4 | 0.25 | 1 | 0 | 1 | 0 |
Calculate d2
Once you have got the rank you compute the difference in the ranks. So, in this case, the difference in the rank for the first data point is 2 and we square it, similarly, we take the difference in the second data point in the ranks between Xi and Yi which is 2, and square it and we get 4. So, like this, we make the difference in the ranks, and by squaring it we get the final what we call the d-squared values. We sum all the values and then we compute the Spearman coefficient by using this value in the above formula.
By putting the value of the d2 and n value
Properties Of Spearman Correlation
- rs takes a value between -1(negative association) and 1(positive association).
- rs = 0 means no association.
- It can be used when the association is not linear.
- It can be applied to ordinal variables.
Monotonic and Non-monotonic relationships
A monotonic relationship is a mathematical relationship between two variables where the direction of the relationship (increase or decrease) remains consistent.
A non-monotonic relationship is a mathematical relationship between two variables where the direction of the relationship is not consistently increasing or decreasing.
Spearman Correlation for Anscombe’s Data
Anscombe’s data also known as Anscombe’s quartet comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. Those 4 sets of 11 data points are given here. Please download the CSV file here. When we plot those points it looks like this. I am considering 3 sets of 11 data points here.
Python code for plotting the data
Python3
import pandas as pd
from scipy.stats import spearmanr
import matplotlib.pyplot as plt
anscombe_data = pd.read_csv(data_url, index_col = 0 )
subset_data = anscombe_data[[ 'x1' , 'y1' , 'x2' , 'y2' , 'x3' , 'y3' , 'x4' , 'y4' ]]
fig, axs = plt.subplots( 2 , 2 , figsize = ( 12 , 8 ))
for i, (x_col, y_col) in enumerate ( zip (subset_data.columns[:: 2 ], subset_data.columns[ 1 :: 2 ])):
row = i / / 2
col = i % 2
subset_data.plot.scatter(x = x_col, y = y_col, ax = axs[row, col], title = f 'Dataset {i+1}' )
correlation = spearmanr(subset_data[x_col], subset_data[y_col]).correlation
axs[row, col].text( 0.5 , 0.9 , f 'Spearman correlation: {correlation:.2f}' ,
ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
if correlation > 0 :
axs[row, col].text( 0.5 , 0.8 , 'Positive correlation' , ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
elif correlation < 0 :
axs[row, col].text( 0.5 , 0.8 , 'Negative correlation' , ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
else :
axs[row, col].text( 0.5 , 0.8 , 'No correlation' , ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
if abs (correlation) > 0.7 :
axs[row, col].text( 0.5 , 0.7 , 'Linear relationship' , ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
else :
axs[row, col].text( 0.5 , 0.7 , 'Non-linear relationship' , ha = 'center' , va = 'center' , transform = axs[row, col].transAxes)
plt.tight_layout()
plt.show()
|
Output:
So, if we apply the Spearman correlation coefficient for each of these data sets we find that it is nearly identical, it does not matter whether you actually apply it to the first data set (top left) or second data set (top right) or the third data set (bottom left). So, what it seems to indicate is that if we apply the Spearman correlation and we find a reasonably high correlation coefficient close to one in this first data set(top left) case. The key point is here we can’t conclude immediately that if the Spearman correlation coefficient is going to be high then there is a linear relationship between them, for example in the second data set(top right) this is a non-linear relationship, and still gives rise to a reasonably high value.
Python Implementation of Spearman’s Rank Correlation
For implementing Spearman’s Rank correlation formula we will use the scipy library. It is one of the most used Python libraries for mathematical computation.
Python3
from scipy.stats import spearmanr
x = [ 1 , 2 , 3 , 4 , 5 ]
y = [ 5 , 4 , 3 , 2 , 1 ]
corr, pval = spearmanr(x, y)
print ( "Spearman's correlation coefficient:" , corr)
print ( "p-value:" , pval)
|
Output:
Spearman's correlation coefficient: -0.9999999999999999
p-value: 1.4042654220543672e-24
Advantages of Spearman’s Rank Correlation:
- This method is easier to understand.
- It is superior for calculating qualitative observations such as the intelligence of people, physical appearance, etc.
- This method is suitable when the series gives only the order of preference and not the actual value of the variable.
- It is robust to the outliers present in the data
- It is designed to capture monotonic relationships between variables. Monotonic relation measures the effect of change in one variable on another variable
Disadvantages of Spearman’s Rank Correlation:
- It is not applicable in the case of grouped data.
- It can handle only a limited number of observations or items.
- It Ignores Non-Monotonic Relationships between the variables for example it does not capture other types of relationships, such as curvilinear or nonlinear associations between the variables.
- It only considers the ranks of the data points and ignores the actual magnitude of differences between the values of the variables.
- Converting the data into ranks for Spearman’s rank correlation discards the original values of the variables and replaces them with their respective ranks. This transformation may result in a loss of information in the data, especially if the variables of the data have meaningful magnitudes or units.
Difference between Spearman and Pearson Correlation
Spearman | Pearson |
Spearman correlation is used for ordinal, interval, or ratio data.
| Pearson correlation is used for continuous, numerical data.
|
Non-parametric and does not assume a specific distribution.
| Assumes that the data follows a bivariate normal distribution.
|
Detects monotonic relationships, including non-linear associations.
| Assumes linear relationships.
|
Less sensitive to outliers.
| Sensitive to outliers.
|
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...