How To Calculate Mahalanobis Distance in Python

Last Updated : 21 Feb, 2022

Mahalanobis distance is defined as the distance between two given points provided that they are in multivariate space. This distance is used to determine statistical analysis that contains a bunch of variables.

The user needs to install and import the following libraries for calculating Mahalanobis Distance in Python:

numpy
pandas
scipy

Syntax to install all the above packages:

pip3 install numpy pandas scipy

Step 1: The first step is to import all the libraries installed above.

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd  
import scipy as stats

Step 2: Creating a dataset. Consider a data of 10 cars of different brands. The data has five sections:

Price
Distance
Emission generated
Performance
Mileage

Python3

# data  
data = { 'Price': [100000, 800000, 650000, 700000, 
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000, 
                      252000, 350000, 260000, 510000, 
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72,  
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,  
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance', 
                                'Mileage']) 

Step 3: Determining the Mahalanobis distance for each observation.

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd  
import scipy as stats 
  
# calculateMahalanobis function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# create new column in dataframe that contains  
# Mahalanobis distance for each row 
df['calculateMahalanobis'] = mahalanobis(x=df, data=df[['Price', 'Distance', 
                                                        'Emission','Performance', 
                                                        'Mileage']])

Combining all steps:

Example:

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd 
import scipy as stats 
  
# calculateMahalanobis function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# data 
data = { 'Price': [100000, 800000, 650000, 700000,  
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000,  
                      252000, 350000, 260000, 510000,  
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72,  
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,  
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance',  
                                'Mileage']) 
  
# Creating a new column in the dataframe that holds 
# the Mahalanobis distance for each row 
df['calculateMahalanobis'] = calculateMahalanobis(y=df, data=df[[ 
  'Price', 'Distance', 'Emission','Performance', 'Mileage']]) 
  
# Display the dataframe 
print(df) 

Output:

Computing the p-value for every Mahalanobis distance

Now let us compute the p-value for every Mahalanobis distance of each observation of the dataset. As you from the above output, some of the Mahalanobis distances are significantly larger than other values. To compute whether some of the distances are statistically significant we need to find their p-value. The p-value for each of the distances is the same as the p-value that belongs to the Chi-Square statistic of the Mahalanobis distance having degrees of freedom equal to k-1, where k = number of variables. So, in this case, we’ll use a degree of freedom of 5-1 = 4.

Example:

Python3

# Importing libraries 
  
import numpy as np 
import pandas as pd 
import scipy as stats 
from scipy.stats import chi2 
  
# calculateMahalanobis Function to calculate 
# the Mahalanobis distance 
def calculateMahalanobis(y=None, data=None, cov=None): 
  
    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 
  
# data 
data = { 'Price': [100000, 800000, 650000, 700000, 
                   860000, 730000, 400000, 870000, 
                   780000, 400000], 
         'Distance': [16000, 60000, 300000, 10000,  
                      252000, 350000, 260000, 510000, 
                      2000, 5000], 
         'Emission': [300, 400, 1230, 300, 400, 104, 
                      632, 221, 142, 267], 
         'Performance': [60, 88, 90, 87, 83, 81, 72, 
                         91, 90, 93], 
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99, 
                     97, 99] 
           } 
  
# Creating dataset 
df = pd.DataFrame(data,columns=['Price', 'Distance', 
                                'Emission','Performance', 
                                'Mileage']) 
  
# Creating a new column in the dataframe that holds 
# the Mahalanobis distance for each row 
df['Mahalanobis'] = calculateMahalanobis(y=df, data=df[[ 
  'Price', 'Distance', 'Emission','Performance', 'Mileage']]) 
  
# calculate p-value for each mahalanobis distance 
df['p'] = 1 - chi2.cdf(df['Mahalanobis'], 3) 
  
# display first five rows of dataframe 
print(df) 

Output:

Interpretation:

Generally, the observation having a p-value less than 0.001 is assumed to be an outlier. In this example, there is no outlier as all the p-values are greater than 0.001.

Suggest improvement

How to disable security certificate checks for requests in Python

How to Perform Dunn’s Test in Python

Share your thoughts in the comments

How To Calculate Mahalanobis Distance in Python

Python3

Python3

Python3

Python3

Computing the p-value for every Mahalanobis distance

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?