The corr_plot
builder takes a dataframe (can be Pandas Dataframe
or just Python dict
) as the input and
builds a correlation plot.
It allows to combine 'tile', 'point' or 'label' layers in a matrix of 'full', 'lower' or 'upper' type.
A call to the terminal build()
method will create a resulting 'plot' object.
This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like + ggtitle()
, + ggsize()
and so on.
The Ames Housing dataset for this demo was downloaded from House Prices - Advanced Regression Techniques (train.csv), (c) Kaggle.
import numpy as np
import pandas as pd
from lets_plot import *
from lets_plot.bistro.corr import *
LetsPlot.setup_html()
mpg_df = pd.read_csv('https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/mpg.csv')\
.drop(columns=['Unnamed: 0']).select_dtypes(include=np.number)
print(mpg_df.shape)
mpg_df.head()
(234, 5)
displ | year | cyl | cty | hwy | |
---|---|---|---|---|---|
0 | 1.8 | 1999 | 4 | 18 | 29 |
1 | 1.8 | 1999 | 4 | 21 | 29 |
2 | 2.0 | 2008 | 4 | 20 | 31 |
3 | 2.0 | 2008 | 4 | 21 | 30 |
4 | 2.8 | 1999 | 6 | 16 | 26 |
def group(plots, width=400, height=300):
"""
Useful for this demo.
"""
bunch = GGBunch()
for idx, p in enumerate(plots):
x = (idx % 2) * width
y = int(idx / 2) * height
bunch.add_plot(p, x, y, width, height)
return bunch
When combining layers, corr_plot
chooses an acceptable plot configuration by default.
group([
corr_plot(mpg_df).tiles().build() + ggtitle("Tiles"),
corr_plot(mpg_df).points().build() + ggtitle("Points"),
corr_plot(mpg_df).tiles().labels().build() + ggtitle("Tiles and labels"),
corr_plot(mpg_df).points().labels().tiles().build() + ggtitle("Tiles, points and labels")
])
The default plot configuration adapts to the changing options - compare 'Tiles and labels' plot above and below.
You can also override the default plot configuration using the parameter 'type' - compare 'Tiles, points and labels' plot above and below.
group([
corr_plot(mpg_df).tiles().labels(color="white").build() + ggtitle("Tiles and labels"),
(corr_plot(mpg_df)
.tiles(type="upper")
.points(type="lower")
.labels(type="full").build() + ggtitle("Tiles, points and labels"))
])
Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or choose one of the available 'Brewer' diverging palettes.
Let's create a gradient resembling one of Seaborn gradients.
bld = corr_plot(mpg_df).points().labels().tiles()
# Configure gradient resembling one of Seaborn gradients.
gradient = (bld
.palette_gradient(low='#417555', mid='#EDEDED', high='#963CA7')
.build()) + ggtitle("Custom gradient")
# Configure Brewer 'BrBG' palette.
brewer = (bld
.palette_BrBG()
.build()) + ggtitle("Brewer")
group([
gradient,
brewer
])
The Kaggle House Prices dataset contains 81 variables.
housing_df = pd.read_csv("https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/Ames_house_prices_train.csv")\
.select_dtypes(include=np.number)
print(housing_df.shape)
housing_df.head()
(1460, 38)
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | 706 | ... | 0 | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | 208500 |
1 | 2 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | 978 | ... | 298 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | 181500 |
2 | 3 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | 486 | ... | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | 223500 |
3 | 4 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | 216 | ... | 0 | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | 140000 |
4 | 5 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | 655 | ... | 192 | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | 250000 |
5 rows × 38 columns
Correlation plot that shows all the correlations in this dataset is too large and barely useful.
corr_plot(housing_df).tiles(type='lower').palette_BrBG().build()
The 'threshold' parameter let us specify a level of significance, below which variables are not shown.
(corr_plot(housing_df, threshold=.5).tiles(diag=False).palette_BrBG().build()
+ ggtitle("Threshold: 0.5")
+ ggsize(550, 400))
Let's further increase our threshold in order to see only highly correlated variables.
(corr_plot(housing_df, threshold=.8)
.tiles(diag=False)
.palette_BrBG().build()
+ ggtitle("Threshold: 0.8")
+ ggsize(550, 400))