Hướng dẫn heatmap correlation python

If you are reading this blog, I am sure you have already seen heatmaps. They are beautiful, yet they reveal just about as much as they conceal. When done right, they are easily readable. When not, they are still great to look at, just maybe not as much functional.

From now on, we are going to take a look at one of the many great uses of heatmaps, the correlation heatmap. Correlation matrices are an essential tool of exploratory data analysis. Correlation heatmaps contain the same information in a visually appealing way. What more: they show in a glance which variables are correlated, to what degree, in which direction, and alerts us to potential multicollinearity problems.

Let’s see how we can work with Seaborn in Python to create a basic correlation heatmap.

For our purposes, we are going to use the Ames housing dataset available on Kaggle.com. This dataset contains over 30 features that potentially affect the variance in sales price, our y-variable.

Since Seaborn had been built on the Matplotlib data visualization library and it is often easier to use the two in combination, besides the usual imports we are going to import Matplotlib.pyplot as well.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

The following code creates the correlation matrix between all the features we are examining and our y-variable.

dataframe.corr()

Hướng dẫn heatmap correlation python

A correlation matrix with a mere 13 variables. Not exactly unreadable. However, why not make life easier?

Basic Seaborn Heatmap

sns.heatmap(dataframe.corr());

Hướng dẫn heatmap correlation python

About as pretty as useless.

Seaborn is easy to use, hard to navigate. It comes with a flood of inbuilt features, and excessive documentation. It can be hard to figure out exactly which arguments to use if you do not want all the bells and whistles.

Let’s make our basic heatmap functional with as little effort as possible.

Take a look at the list of the Seaborn heatmap arguments:

Hướng dẫn heatmap correlation python

vmin, vmax — set the range of values that serve as the basis for the colormap
cmap — sets the specific colormap we want to use (check out the library of a wild range of color palettes here)
center — takes a float to center the colormap; if no cmap specified, will change the colors in the default colormap; if set to True — it changes all the colors of the colormap to blues
annot — when set to True, the correlation values become visible on the colored cells
cbar — when set to False, the colorbar (that serves as a legend) disappears

# Increase the size of the heatmap.plt.figure(figsize=(16, 6))# Store heatmap object in a variable to easily access it when you want to include more features (such as title).
# Set the range of values to be displayed on the colormap from -1 to 1, and set the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(dataframe.corr(), vmin=-1, vmax=1, annot=True)# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

Hướng dẫn heatmap correlation python

A diverging color palette that has markedly different colors at the two ends of the value-range with a pale, almost colorless midpoint, works much better with correlation heatmaps than the default colormap. While illustrating this statement, let’s add one more little detail: how to save a heatmap to a png file with all the x- and y- labels (xticklabels and yticklabels) visible.

plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(dataframe.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
# save heatmap as .png file
# dpi - sets the resolution of the saved image in dots/inches
# bbox_inches - when set to 'tight' - does not allow the labels to be cropped
plt.savefig('heatmap.png', dpi=300, bbox_inches='tight')

Hướng dẫn heatmap correlation python

Stronger correlation on both ends of the spectrum pops out in darker, weaker correlation in lighter shades.

Triangle Correlation Heatmap

Take a look at any of the correlation heatmaps above. If you cut away half of it along the diagonal line marked by 1-s, you would not lose any information. Let’s cut the heatmap in half, then, and keep only the lower triangle.

The Seaborn heatmap ‘mask’ argument comes in handy when we want to cover part of the heatmap.

Mask — takes a boolean array or a dataframe as an argument; when defined, cells become invisible for values where the mask is True

Let’s use the np.triu() numpy function to isolate the upper triangle of a matrix while turning all the values in the lower triangle into 0. (The np.tril() function would do the same, only for the lower triangle.) Using the np.ones_like() function will change all the isolated values into 1.

np.triu(np.ones_like(dataframe.corr()))

Hướng dẫn heatmap correlation python

When we set the datatype to ‘boolean’, all 1 turns into True, all 0 into False.
plt.figure(figsize=(16, 6))# define the mask to set the values in the upper triangle to Truemask = np.triu(np.ones_like(dataframe.corr(), dtype=np.bool))heatmap = sns.heatmap(dataframe.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

Hướng dẫn heatmap correlation python

Correlation of Independent Variables with the Dependent Variable

Often, however, what we want to create, is a colored map that shows the strength of the correlation between every independent variable that we want to include in our model and the dependent variable.

The following code returns the correlation of all features with ‘Sale Price’, a single, dependent variable, sorted by ‘Sale Price’ in a descending manner.

dataframe.corr()[['Sale Price']].sort_values(by='Sale Price', ascending=False)

Let’s use it as the data in our heatmap.

plt.figure(figsize=(8, 12))heatmap = sns.heatmap(dataframe.corr()[['Sale Price']].sort_values(by='Sale Price', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')heatmap.set_title('Features Correlating with Sales Price', fontdict={'fontsize':18}, pad=16);

Hướng dẫn heatmap correlation python

I hope you found what you were looking for in this article.