Correlation Plot

The CorrPlot builder takes a dataframe (Kotlin Map<*, *>) as the input and builds a correlation plot.

If the input has NxN shape and contains only numbers in range [0..1], then it is plotted as is. Otherwise CorrPlot will compute correlation coefficients using the Pearson's method.

CorrPlot allows to combine 'tile', 'point' or 'label' layers in a matrix of "full", "lower" or "upper" type.

A call to the terminal build() method will create a resulting 'plot' object. This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like + ggsize() and so on.

The Ames Housing dataset for this demo was downloaded from House Prices - Advanced Regression Techniques (train.csv), (c) Kaggle.

In [1]:
%useLatestDescriptors
%use lets-plot
%use krangl
In [2]:
// Cars MPG dataset
var mpg_df = DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv")
mpg_df.head(3)
Out[2]:
manufacturermodeldisplyearcyltransdrvctyhwyflclass
1audia41.819994auto(l5)f1829pcompact
2audia41.819994manual(m5)f2129pcompact
3audia42.020084manual(m6)f2031pcompact

Shape: 3 x 12.

In [3]:
mpg_df = mpg_df.remove("")
mpg_df.head(3)
Out[3]:
manufacturermodeldisplyearcyltransdrvctyhwyflclass
audia41.819994auto(l5)f1829pcompact
audia41.819994manual(m5)f2129pcompact
audia42.020084manual(m6)f2031pcompact

Shape: 3 x 11.

In [4]:
val mpg_dat = mpg_df.toMap()

Combining 'tile', 'point' and 'label' layers.

When combining layers, CorrPlot chooses an acceptable plot configuration by default.

In [5]:
gggrid(
    listOf(
        CorrPlot(mpg_dat, "Tiles").tiles().build(),
        CorrPlot(mpg_dat, "Points").points().build(), 
        CorrPlot(mpg_dat, "Tiles and labels").tiles().labels().build(),
        CorrPlot(mpg_dat, "Tiles, points and labels").points().labels().tiles().build()
    ), 2, 400, 320)
Out[5]:

The default plot configuration adapts to the changing options - compare "Tiles and labels" plot above and below.

You can also override the default plot configuration using the parameter type - compare "Tiles, points and labels" plot above and below.

In [6]:
gggrid(
    listOf(
        CorrPlot(mpg_dat, "Tiles and labels").tiles().labels(color="white").build(),
        CorrPlot(mpg_dat, "Tiles, points and labels")
         .tiles(type="upper")
         .points(type="lower")
         .labels(type="full").build()
    ), 2, 400, 320)
Out[6]:

Customizing colors.

Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or choose one of the available 'Brewer' diverging palettes.

Let's create a gradient resembling one of Seaborn gradients.

In [7]:
val corrPlot = CorrPlot(mpg_dat).points().labels().tiles()

// Configure gradient resembling one of Seaborn gradients.
val withGradientColors = (corrPlot
            .paletteGradient(low="#417555", mid="#EDEDED", high="#963CA7")
            .build()) + ggtitle("Custom gradient")

// Configure Brewer 'BrBG' palette.
val withBrewerColors = (corrPlot
            .paletteSpectral()
            .build()) + ggtitle("Brewer 'Spectral'")

// Show both plots
gggrid(listOf(withGradientColors, withBrewerColors), 2, 400, 320)
Out[7]:

Correlation plot with large number of variables in dataset.

The Kaggle House Prices dataset contains 81 variables.

In [8]:
val housing_df = DataFrame.readCSV("../data/Ames_house_prices_train.csv")
housing_df.head(3)
Out[8]:
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingHeatingQCCentralAirElectrical1stFlrSF2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
160RL658450PavenullRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196GdTAPConcGdTANoGLQ706Unf0150856GasAExYSBrkr85685401710102131Gd8Typ0nullAttchd2003RFn2548TATAY0610000nullnullnull022008WDNormal208500
220RL809600PavenullRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0TATACBlockGdTAGdALQ978Unf02841262GasAExYSBrkr1262001262012031TA6Typ1TAAttchd1976RFn2460TATAY29800000nullnullnull052007WDNormal181500
360RL6811250PavenullIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162GdTAPConcGdTAMnGLQ486Unf0434920GasAExYSBrkr92086601786102131Gd6Typ1TAAttchd2001RFn2608TATAY0420000nullnullnull092008WDNormal223500

Shape: 3 x 81.

Correlation plot that shows all the correlations in this dataset is too large and barely useful.

In [9]:
CorrPlot(housing_df.toMap())
    .tiles(type="lower")
    .paletteBrBG()
    .build()
Out[9]:

The threshold parameter.

The threshold parameter let us specify a level of significance, below which variables are not shown.

In [10]:
CorrPlot(housing_df.toMap(), "Threshold: 0.5", threshold = 0.5, adjustSize = 0.7)
    .tiles(type="full", diag=false)
    .paletteBrBG()
    .build()
Out[10]:

Let's further increase our threshold in order to see only highly correlated variables.

In [11]:
CorrPlot(housing_df.toMap(), "Threshold: 0.8", threshold = 0.8)
    .tiles(diag=false)
    .labels(color="white", diag=false)
    .paletteBrBG()
    .build()
Out[11]: