Open In App

How to Calculate Mahalanobis Distance in R?

Last Updated : 27 Jan, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to calculate Mahalanobis distance in R Programming Language.

Mahalanobis distance is used to calculate the distance between two points or vectors in a multivariate distance metric space which is a statistical analysis involving several variables. To start with we need a dataframe. 

Example: Create dataframe

R




set.seed(700)
score_1 <− rnorm(20,12,1)
score_2 <− rnorm(20,11,12)
score_3 <− rnorm(20,15,23)
score_4 <− rnorm(20,16,3)
  
df <− data.frame(score_1, score_2, score_3, score_4)
df


Output:

    score_1     score_2    score_3   score_4
1  11.91218  20.3843568  68.179655 12.864159
2  11.77103  13.5718323 -30.953642 15.241168
3  11.91570  29.9250800  42.570528  7.179686
4  10.25905  10.7594514  17.879960 19.639647
5  13.01343  15.7463448   3.185857 12.776482
6  11.78211  14.9688992  31.368892 16.043620
7  13.51328  10.5017826  58.985715 14.701817
8  11.10565  20.4965614   6.806652 15.876947
9  11.20834  12.7588547  10.461229 16.991393
10 11.10233 -10.3961351  18.082209 15.258644
11 12.34732  -0.8615359  57.411750 13.400421
12 12.08361  15.0248600 -17.853098 13.999682
13 12.86457  -6.1221908  23.184838 20.389762
14 10.58871  17.1000715  20.900155 12.560962
15 10.74134   6.3728076  39.173259 17.865589
16 11.20248   8.8909128  24.696939 14.384012
17 12.89797  34.8522136  10.035498 14.975053
18 11.37993  14.4232355  28.129197 16.395271
19 11.78309  14.9324201  23.584362 14.765245
20 12.77480  30.7969171  -9.635902 10.203178

 mahalanobis() function is used to calculate Mahalanobis distance in R. It is a builtin type.

Syntax: mahalanobis(Data , center, cov)

where:

  • Data: matrix or vector of data
  • center: mean vector
  • cov: covariance matrix

Example: Calculate Mahalanobis distance

R




mahalanobis(df, colMeans(df), cov(df))


Output:

4.46866714558536 4.61260586529474 7.41513071619846 5.21448589688871

 2.84292222223026 0.673116763926688 6.04984394951585 1.72865361097932

  1.03750690527476 7.21856549018804 4.85579110162481 2.90808365141091 

  7.57223884458172 3.27702692226183 2.68208130355785 0.916110244005359 

  6.79796970070888 0.829693729587342 0.0356208551487593 4.86388508103035

Calculate the Mahalanobis for each row

On based on Mahalanobis distance, we found some of the distances are much higher than other’s and to identify that is statistically significant then we need to calculate p-values.

Example: Calculate Mahalanobis distance for each row

R




# create new column for Mahalanobis distances
df$mahalnobis<- mahalanobis(df, colMeans(df), cov(df))
df


Output:

    score_1     score_2    score_3   score_4
1  11.91218  20.3843568  68.179655 12.864159
2  11.77103  13.5718323 -30.953642 15.241168
3  11.91570  29.9250800  42.570528  7.179686
4  10.25905  10.7594514  17.879960 19.639647
5  13.01343  15.7463448   3.185857 12.776482
6  11.78211  14.9688992  31.368892 16.043620
7  13.51328  10.5017826  58.985715 14.701817
8  11.10565  20.4965614   6.806652 15.876947
9  11.20834  12.7588547  10.461229 16.991393
10 11.10233 -10.3961351  18.082209 15.258644
11 12.34732  -0.8615359  57.411750 13.400421
12 12.08361  15.0248600 -17.853098 13.999682
13 12.86457  -6.1221908  23.184838 20.389762
14 10.58871  17.1000715  20.900155 12.560962
15 10.74134   6.3728076  39.173259 17.865589
16 11.20248   8.8909128  24.696939 14.384012
17 12.89797  34.8522136  10.035498 14.975053
18 11.37993  14.4232355  28.129197 16.395271
19 11.78309  14.9324201  23.584362 14.765245
20 12.77480  30.7969171  -9.635902 10.203178

Calculate the p-value

The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1(k = number of variables) degrees.

pchisq() function is used to compute cumulative chi-square density.

Syntax: pchisq(vec, df)

Parameters:

  • vec: Vector of x-values
  • df: Degree of Freedom

Example: Calculate p-value

R




# create new column for p-value 
df$pvalue <- pchisq(df$mahalnobis, df=3)
df


Output:

In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads