Exploring Correlation in Python

This article aims to give a better understanding of a very important technique of multivariate exploration.

Correlation Matrix is basically a covariance matrix. Also known as the auto-covariance matrix, dispersion matrix, variance matrix, or variance-covariance matrix. It is a matrix in which i-j position defines the correlation between the ith and jth parameter of the given data-set.

When the data points follow a roughly straight-line trend, the variables are said to have an approximately linear relationship. In some cases, the data points fall close to a straight line, but more often there is quite a bit of variability of the points around the straight-line trend. A summary measure called the correlation describes the strength of the linear association. Correlation summarizes the strength and direction of the linear (straight-line) association between two quantitative variables. Denoted by r, it takes values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association.
The closer r is to 1 the closer the data points fall to a straight line, thus, the linear association is stronger. The closer r is to 0, making the linear association weaker.



To get the link to House_price Data click here.

Loading Libraries

filter_none

edit
close

play_arrow

link
brightness_4
code

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

chevron_right


Loading Data

filter_none

edit
close

play_arrow

link
brightness_4
code

data = pd.read_csv("House Price.csv")
data.shape

chevron_right


Output:

(1460, 81)

‘Sales Price’ Description

filter_none

edit
close

play_arrow

link
brightness_4
code

data['SalePrice'].describe()

chevron_right


Output:

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

Histogram

filter_none

edit
close

play_arrow

link
brightness_4
code

plt.figure(figsize = (9, 5))
data['SalePrice'].plot(kind ="hist")

chevron_right


Output:

 

Code #1: Correlation Matrix


filter_none

edit
close

play_arrow

link
brightness_4
code

corrmat = data.corr()
  
f, ax = plt.subplots(figsize =(9, 8))
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1)

chevron_right


Output:

Code #2: Grid Correlation Matrix

filter_none

edit
close

play_arrow

link
brightness_4
code

corrmat = data.corr()
  
cg = sns.clustermap(corrmat, cmap ="YlGnBu", linewidths = 0.1);
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation = 0)
  
cg

chevron_right


Output:

Code #3: Correlation for Saleprice

filter_none

edit
close

play_arrow

link
brightness_4
code

# saleprice correlation matrix
# k : number of variables for heatmap
k = 15 
  
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
  
cm = np.corrcoef(data[cols].values.T)
f, ax = plt.subplots(figsize =(12, 10))
  
sns.heatmap(cm, ax = ax, cmap ="YlGnBu",
            linewidths = 0.1, yticklabels = cols.values, 
                              xticklabels = cols.values)

chevron_right


Output:



My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.