Open In App

Spearman’s Rank Correlation

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Share
Report issue
Report

Correlation measures the strength of the association between two variables. For instance, if we are interested in knowing whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question. To learn more about correlation, please refer to this. 

Methods for correlation analysis: There are mainly two types of correlation:

  • Parametric Correlation: static Pearson correlation(r): It measures a linear dependence between two variables (x and y) and is known as a parametric correlation test because it depends on the distribution of the data. It is used for numerical data.
  • Non-Parametric Correlation – Kendall(tau) and Spearman(rho): They are rank-based correlation coefficients, and are known as non-parametric correlation.It is used for categorical data.

What is Spearman’s Correlation

Spearman’s Rank Correlation is a statistical measure of the strength and direction of the monotonic relationship between two continuous variables. Therefore, these attributes are ranked or put in the order of their preference. It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.

Spearman’s Correlation formula

\rho = 1 - \frac{6\sum d_{i}^{2}}{n(n^2-1)}

where, 

\rho = Spearman Correlation coefficient 

rank = the position or order of a variable’s value relative to other values within a dataset

di = the difference in the ranks given to the two variables values for each item of the data

n = total number of observation

Compute Spearman’s Rank Correlation Stepwise

Converting the original data into ranks:

Creating ranks involves assigning a numerical order to the values in a dataset, where the smallest value gets the rank of 1, the second smallest gets the rank of 2, and so on.

Data:

Number

X1

Y1

1

7

5

2

6

4

3

4

5

4

5

6

5

8

10

6

7

7

7

10

9

8

3

2

9

9

8

10

2

1

Creating Ranks for X1:

  1. Sort the values of X1 in ascending order: 2, 3, 4, 5, 6, 7, 7, 8, 9, 10.
  2. Assign ranks based on the sorted order: 1, 2, 3, 4, 5,6.5, 6.5, 8, 9, 10. Since there are two tied values (6 and 7), their average rank is assigned (6.5).

Note: If numbers are tied, their average of their ranks are considered.

Doing the same for Y1 we get:

Number

Rank X1

Rank Y1

1

6.5

4.5

2

5

3

3

3

4.5

4

4

6

5

8

10

6

6.5

7

7

10

9

8

2

2

9

9

8

10

1

1

Spearman’s Correlation Claculations

In Spearman’s rank correlation, the process involves converting the original data into ranks. This is done to assess the monotonic relationship between two variables without relying on the specific numerical values of the data points.

Let’s consider taking 10 different data points in variables X1 and Y1. Then follow the steps:

  • Arrange the values in ascending order, from the smallest to the largest.
  • Assign ranks to each value based on its position in the sorted order. The smallest value gets a rank of 1, the second smallest gets a rank of 2, and so on.
  • Then find out the square of the difference in the ranks given to the two variables values for each item of the data.
Number12345678910
X176458710392
Y154561079281
Rank X16.553486.510291
Rank Y14.534.561079281
d2442.25440.251010

Calculate d2

Once you have got the rank you compute the difference in the ranks. So, in this case, the difference in the rank for the first data point is 2 and we square it, similarly, we take the difference in the second data point in the ranks between Xi and Yi which is 2, and square it and we get 4. So, like this, we make the difference in the ranks, and by squaring it we get the final what we call the d-squared values. We sum all the values and then we compute the Spearman coefficient by using this value in the above formula.

By putting the value of the d2 and n value

\begin{aligned}  \rho &= 1 - \frac{6\sum d_{i}^{2}}{n(n^2-1)} \\&=1-\frac{6(4+4+2.25+4+4+0.25+1+0+1+0)}{10(10^2-1)} \\&=1-\frac{6\times20.5}{990} \\&=1-\frac{123}{990} \\&=1-0.12424242424242424 \\&=0.8757575757575757 \\&\approx 0.88 \end{aligned}

Properties Of Spearman Correlation

  • rs takes a value between -1(negative association) and 1(positive association).
  • rs = 0 means no association.
  • It can be used when the association is not linear.
  • It can be applied to ordinal variables.

Monotonic and Non-monotonic relationships

A monotonic relationship is a mathematical relationship between two variables where the direction of the relationship (increase or decrease) remains consistent.

A non-monotonic relationship is a mathematical relationship between two variables where the direction of the relationship is not consistently increasing or decreasing.

file

Spearman Correlation for Anscombe’s Data

Anscombe’s data also known as Anscombe’s quartet comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. Those 4 sets of 11 data points are given here. Please download the CSV file here. When we plot those points it looks like this. I am considering 3 sets of 11 data points here.

Python code for plotting the data  

Python3

import pandas as pd
from scipy.stats import spearmanr
import matplotlib.pyplot as plt
 
# Load Anscombe's Data from CSV
 
anscombe_data = pd.read_csv(data_url, index_col=0)
 
# Selecting four sets of 11 data points
subset_data = anscombe_data[['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4']]
 
# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
 
# Plot scatter plots and display Spearman correlation
for i, (x_col, y_col) in enumerate(zip(subset_data.columns[::2], subset_data.columns[1::2])):
    row = i // 2
    col = i % 2
 
    # Scatter plot
    subset_data.plot.scatter(x=x_col, y=y_col, ax=axs[row, col], title=f'Dataset {i+1}')
 
    # Calculate and display Spearman correlation
    correlation = spearmanr(subset_data[x_col], subset_data[y_col]).correlation
    axs[row, col].text(0.5, 0.9, f'Spearman correlation: {correlation:.2f}',
                       ha='center', va='center', transform=axs[row, col].transAxes)
 
    # Display whether the correlation is positive or negative
    if correlation > 0:
        axs[row, col].text(0.5, 0.8, 'Positive correlation', ha='center', va='center', transform=axs[row, col].transAxes)
    elif correlation < 0:
        axs[row, col].text(0.5, 0.8, 'Negative correlation', ha='center', va='center', transform=axs[row, col].transAxes)
    else:
        axs[row, col].text(0.5, 0.8, 'No correlation', ha='center', va='center', transform=axs[row, col].transAxes)
 
    # Determine linearity based on correlation
    if abs(correlation) > 0.7# You can adjust this threshold based on your criteria
        axs[row, col].text(0.5, 0.7, 'Linear relationship', ha='center', va='center', transform=axs[row, col].transAxes)
    else:
        axs[row, col].text(0.5, 0.7, 'Non-linear relationship', ha='center', va='center', transform=axs[row, col].transAxes)
 
# Adjust layout for better spacing
plt.tight_layout()
 
# Show the plots
plt.show()

                    

Output:

download

So, if we apply the Spearman correlation coefficient for each of these data sets we find that it is nearly identical, it does not matter whether you actually apply it to the first data set (top left) or second data set (top right) or the third data set (bottom left). So, what it seems to indicate is that if we apply the Spearman correlation and we find a reasonably high correlation coefficient close to one in this first data set(top left) case. The key point is here we can’t conclude immediately that if the Spearman correlation coefficient is going to be high then there is a linear relationship between them, for example in the second data set(top right) this is a non-linear relationship, and still gives rise to a reasonably high value.

Python Implementation of Spearman’s Rank Correlation

For implementing Spearman’s Rank correlation formula we will use the scipy library. It is one of the most used Python libraries for mathematical computation.

Python3

from scipy.stats import spearmanr
 
# sample data
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]
 
# calculate Spearman's correlation coefficient and p-value
corr, pval = spearmanr(x, y)
 
# print the result
print("Spearman's correlation coefficient:", corr)
print("p-value:", pval)

                    

Output:

Spearman's correlation coefficient: -0.9999999999999999
p-value: 1.4042654220543672e-24 

Advantages of Spearman’s Rank Correlation:

  • This method is easier to understand.
  • It is superior for calculating qualitative observations such as the intelligence of people, physical appearance, etc.
  • This method is suitable when the series gives only the order of preference and not the actual value of the variable.
  • It is robust to the outliers present in the data 
  • It is designed to capture monotonic relationships between variables. Monotonic relation measures the effect of change in one variable on another variable 

Disadvantages of Spearman’s Rank Correlation:

  • It is not applicable in the case of grouped data.
  • It can handle only a limited number of observations or items. 
  • It Ignores Non-Monotonic Relationships between the variables for example it does not capture other types of relationships, such as curvilinear or nonlinear associations between the variables. 
  • It only considers the ranks of the data points and ignores the actual magnitude of differences between the values of the variables.
  • Converting the data into ranks for Spearman’s rank correlation discards the original values of the variables and replaces them with their respective ranks. This transformation may result in a loss of information in the data, especially if the variables of the data have meaningful magnitudes or units.

Difference between Spearman and Pearson Correlation

Spearman

Pearson

Spearman correlation is used for ordinal, interval, or ratio data.

Pearson correlation is used for continuous, numerical data.

Non-parametric and does not assume a specific distribution.

Assumes that the data follows a bivariate normal distribution.

Detects monotonic relationships, including non-linear associations.

Assumes linear relationships.

Less sensitive to outliers.

Sensitive to outliers.





Last Updated : 23 Jan, 2024
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads