Open In App

Create a correlation Matrix using Python

In the field of data science and machine learning, a correlation matrix aids in understanding relationships between variables. Correlation matrix represents how different variables interact with each other.

For someone who is navigating the complex landscape of data, understanding and harnessing the potential of correlation matrices is a skill that can significantly enhance their ability to drive meaningful insights. In this article, we will explore the step-by-step process of creating a correlation matrix in Python.



What is correlation?

Correlation is a statistical indicator that quantifies the degree to which two variables change in relation to each other. It indicates the strength and direction of the linear relationship between two variables. The correlation coefficient is denoted by “r”, and it ranges from -1 to 1.

There are two popular methods used to find the correlation coefficients:



Pearson’s product-moment correlation coefficient

The Pearson correlation coefficient (r) is a measure of linear relationship between two variables.

r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²]

Here,

Spearman’s rank correlation coefficient

The Spearman’s rank correlation coefficient is a measure of statistical dependence between two variables. It is based on the ranks of the data rather than the actual data values.

\rho = 1 – \frac{6 \sum d^2}{n(n^2 -1)}

Here,

What is a Correlation Matrix?

A correlation is a tabular representation that displays correlation coefficients, indicating the strength and direction of relationships between variables in a dataset. Within this matrix, each cell signifies the correlation between two specific variables. This tool serves multiple purposes, serving as a summary of data relationships, input for more sophisticated analyses, and a diagnostic aid for advanced analytical procedures. By presenting a comprehensive overview of inter-variable correlations, the matrix becomes invaluable in discerning patterns, guiding further analyses, and identifying potential areas of interest or concern in the dataset. Its applications extend beyond mere summary statistics, positioning it as a fundamental component in the preliminary stages of diverse and intricate data analyses.

Interpreting the correlation matrix

How to create correlation matrix in Python?

A correlation matrix has been created using the following two libraries:

  1. NumPy Library
  2. Pandas Library

Creating a correlation matrix using NumPy Library

NumPy is a library for mathematical computations. It can be used for creating correlation matrices that helps to analyze the relationships between the variables through matric representation.

Example 1

Suppose an ice cream shop keeps track of total sales of ice creams versus the temperature on that day. To learn the correlation, we will use NumPy library.

In the following code snippet, x and y represent total sales in dollars and corresponding temperatures for each day of sale and np.corrcoef() function is sed to compute the correlation matrix.

Python3
import numpy as np

# x represents the total sale in dollars
x = [215, 325, 185, 332, 406, 522, 412,
     614, 544, 421, 445, 408],

# y represents the temperature on each day of sale
y = [14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 
     19.4, 25.1, 23.4, 18.1, 22.6, 17.2]

# create correlation matrix
matrix = np.corrcoef(x, y)
print(matrix)

Output:

[[1.         0.95750662]
 [0.95750662 1.        ]]

From the above matrix, if we see cell (0,1) and (1,0) both have the same value equal to 0.95750662 which lead us to conclude that whenever the temperature is high, we have more sales. Let’s have a look at another example.

Example 2

Suppose we are given glucose level in boy respective to age. To find correlation between age(x) and glucose level in body(y), we will again use NumPy library. In the following code snippet, the variables x and y represent age and corresponding glucose levels. The np.corrcoef() is used to calculate the correlation between the x and y variables.

Python3
import numpy as np

# x represents the age
x = [43, 21, 25, 42, 57, 59]

# y represents the glucose level corresponding to that age
y = [99, 65, 79, 75, 87, 81]

# correlation matrix
matrix = np.corrcoef(x, y)
print(matrix)

Output

[[1.        0.5298089]
 [0.5298089 1.       ]]

From the above correlation matrix, 0.5298089 or 52.98% that means the variable has a moderate positive correlation.

Creating correlation matrix using Pandas library 

Pandas is a library with built-in functionalities using which user can analyze and interpret the relationships between variables.

Example 1:

To illustrate this example, we have created a data frame with three variables and have calculated the correlation matrix. In order to create a correlation matrix, we used corr() method on data frames.

Python3
import pandas as pd

# collect data
data = {
    'x': [45, 37, 42, 35, 39],
    'y': [38, 31, 26, 28, 33],
    'z': [10, 15, 17, 21, 12]
}

# form dataframe
dataframe = pd.DataFrame(data, columns=['x', 'y', 'z'])
print("Dataframe is : ")
print(dataframe)

# form correlation matrix
matrix = dataframe.corr()
print("Correlation matrix is : ")
print(matrix)

 Output:

Dataframe is : 
    x   y   z
0  45  38  10
1  37  31  15
2  42  26  17
3  35  28  21
4  39  33  12
Correlation matrix is : 
          x         y         z
x  1.000000  0.518457 -0.701886
y  0.518457  1.000000 -0.860941
z -0.701886 -0.860941  1.000000

Example 2:

In this example, we will consider Iris dataset and find correlation between the features of the dataset.

Python3
from sklearn import datasets  
import pandas as pd 

#load iris dataset
dataset = datasets. load_iris ()  
dataframe = pd. DataFrame (data = dataset. data, columns = dataset. feature_names)  
dataframe ["target"] = dataset. target  

#correlation matrix
matrix = dataframe.corr()
print(matrix)

Output:

                   sepal length (cm)  sepal width (cm)  petal length (cm)  \
sepal length (cm)           1.000000         -0.117570           0.871754   
sepal width (cm)           -0.117570          1.000000          -0.428440   
petal length (cm)           0.871754         -0.428440           1.000000   
petal width (cm)            0.817941         -0.366126           0.962865   
target                   0.782561         -0.426658           0.949035   

                   petal width (cm)  target  
sepal length (cm)          0.817941  0.782561  
sepal width (cm)          -0.366126 -0.426658  
petal length (cm)          0.962865  0.949035  
petal width (cm)           1.000000  0.956547  
target                   0.956547  1.000000 

How to visualize correlation matrix in Python?

There are two popular libraries for data visualization, Matplotlib and Seaborn. Let’s first visualize the code using Matplotlib using the iris dataset.

Correlation matrix using Matplotlib

In the visualization part of the code,

Python3

import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd 
dataset = datasets. load_iris ()  
dataframe = pd. DataFrame (data = dataset. data, columns = dataset. feature_names)  
dataframe ["target"] = dataset. target  
matrix = dataframe.corr()

#plotting correlation matrix 
plt.imshow(matrix, cmap='Blues')

#adding colorbar 
plt.colorbar()

#extracting variable names 
variables = []
for i in matrix.columns:
  variables.append(i)

# Adding labels to the matrix
plt.xticks(range(len(matrix)), variables, rotation=45, ha='right')
plt.yticks(range(len(matrix)), variables)

# Display the plot
plt.show()

Output:

Heatmap of correlation matrix created using matplotlib

Correlation Matrix using Seaborn

Let’s visualize using Seaborn.

Python3

import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn import datasets  
import pandas as pd 
dataset = datasets. load_iris ()  
dataframe = pd. DataFrame (data = dataset. data, columns = dataset. feature_names)  
dataframe ["target"] = dataset. target  
matrix = dataframe.corr()

#plotting correlation matrix 
sns.heatmap(matrix, cmap="Greens", annot=True)

Output:

Heatmap of correlation values using Seaborn

Also Check:


Article Tags :