How to create a seaborn correlation heatmap in Python?
Seaborn is a Python library that is based on matplotlib and is used for data visualization. It provides a medium to present data in a statistical graph format as an informative and attractive medium to impart some information. A heatmap is one of the components supported by seaborn where variation in related data is portrayed using a color palette. This article centrally focuses on a correlation heatmap and how seaborn in combination with pandas and matplotlib can be used to generate one for a dataframe.
Like any another Python library, seaborn can be easily installed using pip:
pip install seaborn
This library is a part of Anaconda distribution and usually works just by import if your IDE is supported by Anaconda, but it can be installed too by the following command:
conda install seaborn
A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data from usually a monochromatic scale. The values of the first dimension appear as the rows of the table while of the second dimension as a column. The color of the cell is proportional to the number of measurements that match the dimensional value. This makes correlation heatmaps ideal for data analysis since it makes patterns easily readable and highlights the differences and variation in the same data. A correlation heatmap, like a regular heatmap, is assisted by a colorbar making data easily readable and comprehensible.
The following steps show how a correlation heatmap can be produced:
- Import all required modules first
- Import the file where your data is stored
- Plot a heatmap
- Display it using matplotlib
For plotting heatmap method of the seaborn module will be used.
Syntax: heatmap(data, vmin, vmax, center, cmap,……………………………………………………)
Except for data all other attributes are optional and data obviously will be the data to be plotted. The data here has to be passed with corr() method to generate a correlation heatmap. Also, corr() itself eliminates columns which will be of no use while generating a correlation heatmap and selects those which can be used.
For the example given below, here a dataset downloaded from kaggle.com is being used. The plot shows data related to bestseller novels on amazon.
Dataset used – Bestsellers
The above example deals with small data. The following example depicts how the output will look like for a large dataset,
The dataset used in this example is an exoplanet space research dataset compiled by NASA.
Dataset used – cumulative