Open In App

How To Calculate Mahalanobis Distance in Python

Improve
Improve
Like Article
Like
Save
Share
Report

Mahalanobis distance is defined as the distance between two given points provided that they are in multivariate space. This distance is used to determine statistical analysis that contains a bunch of variables.

The user needs to install and import the following libraries for calculating Mahalanobis Distance in Python:

  • numpy
  • pandas
  • scipy

Syntax to install all the above packages:

pip3 install numpy pandas scipy

Step 1: The first step is to import all the libraries installed above.

Python3




# Importing libraries
  
import numpy as np
import pandas as pd 
import scipy as stats


Step 2: Creating a dataset. Consider a data of 10 cars of different brands. The data has five sections: 

  • Price
  • Distance
  • Emission generated
  • Performance
  • Mileage

Python3




# data 
data = { 'Price': [100000, 800000, 650000, 700000,
                   860000, 730000, 400000, 870000,
                   780000, 400000],
         'Distance': [16000, 60000, 300000, 10000,
                      252000, 350000, 260000, 510000,
                      2000, 5000],
         'Emission': [300, 400, 1230, 300, 400, 104,
                      632, 221, 142, 267],
         'Performance': [60, 88, 90, 87, 83, 81, 72
                         91, 90, 93],
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99
                     97, 99]
           }
  
# Creating dataset
df = pd.DataFrame(data,columns=['Price', 'Distance',
                                'Emission','Performance',
                                'Mileage'])


Step 3: Determining the Mahalanobis distance for each observation.

Python3




# Importing libraries
  
import numpy as np
import pandas as pd 
import scipy as stats
  
# calculateMahalanobis function to calculate
# the Mahalanobis distance
def calculateMahalanobis(y=None, data=None, cov=None):
  
    y_mu = y - np.mean(data)
    if not cov:
        cov = np.cov(data.values.T)
    inv_covmat = np.linalg.inv(cov)
    left = np.dot(y_mu, inv_covmat)
    mahal = np.dot(left, y_mu.T)
    return mahal.diagonal()
  
# create new column in dataframe that contains 
# Mahalanobis distance for each row
df['calculateMahalanobis'] = mahalanobis(x=df, data=df[['Price', 'Distance',
                                                        'Emission','Performance',
                                                        'Mileage']])


Combining all steps:

Example:

Python3




# Importing libraries
  
import numpy as np
import pandas as pd
import scipy as stats
  
# calculateMahalanobis function to calculate
# the Mahalanobis distance
def calculateMahalanobis(y=None, data=None, cov=None):
  
    y_mu = y - np.mean(data)
    if not cov:
        cov = np.cov(data.values.T)
    inv_covmat = np.linalg.inv(cov)
    left = np.dot(y_mu, inv_covmat)
    mahal = np.dot(left, y_mu.T)
    return mahal.diagonal()
  
# data
data = { 'Price': [100000, 800000, 650000, 700000
                   860000, 730000, 400000, 870000,
                   780000, 400000],
         'Distance': [16000, 60000, 300000, 10000
                      252000, 350000, 260000, 510000
                      2000, 5000],
         'Emission': [300, 400, 1230, 300, 400, 104,
                      632, 221, 142, 267],
         'Performance': [60, 88, 90, 87, 83, 81, 72
                         91, 90, 93],
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99
                     97, 99]
           }
  
# Creating dataset
df = pd.DataFrame(data,columns=['Price', 'Distance',
                                'Emission','Performance'
                                'Mileage'])
  
# Creating a new column in the dataframe that holds
# the Mahalanobis distance for each row
df['calculateMahalanobis'] = calculateMahalanobis(y=df, data=df[[
  'Price', 'Distance', 'Emission','Performance', 'Mileage']])
  
# Display the dataframe
print(df)


Output:

Computing the p-value for every Mahalanobis distance

Now let us compute the p-value for every Mahalanobis distance of each observation of the dataset. As you from the above output, some of the Mahalanobis distances are significantly larger than other values. To compute whether some of the distances are statistically significant we need to find their p-value. The p-value for each of the distances is the same as the p-value that belongs to the Chi-Square statistic of the Mahalanobis distance having degrees of freedom equal to k-1, where k = number of variables. So, in this case, we’ll use a degree of freedom of 5-1 = 4.

Example:

Python3




# Importing libraries
  
import numpy as np
import pandas as pd
import scipy as stats
from scipy.stats import chi2
  
# calculateMahalanobis Function to calculate
# the Mahalanobis distance
def calculateMahalanobis(y=None, data=None, cov=None):
  
    y_mu = y - np.mean(data)
    if not cov:
        cov = np.cov(data.values.T)
    inv_covmat = np.linalg.inv(cov)
    left = np.dot(y_mu, inv_covmat)
    mahal = np.dot(left, y_mu.T)
    return mahal.diagonal()
  
# data
data = { 'Price': [100000, 800000, 650000, 700000,
                   860000, 730000, 400000, 870000,
                   780000, 400000],
         'Distance': [16000, 60000, 300000, 10000
                      252000, 350000, 260000, 510000,
                      2000, 5000],
         'Emission': [300, 400, 1230, 300, 400, 104,
                      632, 221, 142, 267],
         'Performance': [60, 88, 90, 87, 83, 81, 72,
                         91, 90, 93],
         'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,
                     97, 99]
           }
  
# Creating dataset
df = pd.DataFrame(data,columns=['Price', 'Distance',
                                'Emission','Performance',
                                'Mileage'])
  
# Creating a new column in the dataframe that holds
# the Mahalanobis distance for each row
df['Mahalanobis'] = calculateMahalanobis(y=df, data=df[[
  'Price', 'Distance', 'Emission','Performance', 'Mileage']])
  
# calculate p-value for each mahalanobis distance
df['p'] = 1 - chi2.cdf(df['Mahalanobis'], 3)
  
# display first five rows of dataframe
print(df)


Output:

Interpretation:

Generally, the observation having a p-value less than 0.001 is assumed to be an outlier. In this example, there is no outlier as all the p-values are greater than 0.001.



Last Updated : 21 Feb, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads