PiML Toolbox: Uploading Custom Data in Two Ways¶

Other than the built-in datasets (e.g. BikeSharing, CaliforniaHousing, TaiwanCredit) for demo purpose, PiML supports custom data uploading for model development and validation.

This example notebook demonstrates how to upload/read custom data in two ways.

Upload new data by the piml.Experiment.data_loader() widget (with file size limit 10MB)
Manually read data by panda.read_csv(), then register to piml.Experiment.

For simplicity, we employ the CASP dataset you may download from https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv (~3.4MB, with details here ) and save it to your local drive.

Stage 0: Install PiML package on Google Colab¶

Run !pip install piml to install the latest version of PiML
In Colab, you'll need restart the runtime in order to use newly installed PiML version.

In [ ]:

!pip install piml

Stage 1: Initialize an experiment, Load and Prepare data ¶

In [1]:

from piml import Experiment
exp = Experiment()

In [2]:

# The first way: upload new data from your local drive by the widget (with file size limit 10MB)
# Choose "Upload new data"
exp.data_loader() 

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Data', 'CoCircles', 'Friedman', 'BikeShar…

In [3]:

# The second way: manually read data from your local drive or URL, then register to piml.Experiment
import pandas as pd
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv") 
exp.data_loader(data=data)

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

	RMSD	F1	F2	F3	F4	F5	F6	F7	F8	F9
0	17.284	13558.30	4305.35	0.31754	162.1730	1.872791e+06	215.3590	4287.87	102.0	27.0302
1	6.021	6191.96	1623.16	0.26213	53.3894	8.034467e+05	87.2024	3328.91	39.0	38.5468
2	9.275	7725.98	1726.28	0.22343	67.2887	1.075648e+06	81.7913	2981.04	29.0	38.8119
3	15.851	8424.58	2368.25	0.28111	67.8325	1.210472e+06	109.4390	3248.22	70.0	39.0651
4	7.962	7460.84	1736.94	0.23280	52.4123	1.021020e+06	94.5234	2814.42	41.0	39.9147
...	...	...	...	...	...	...	...	...	...	...
45725	3.762	8037.12	2777.68	0.34560	64.3390	1.105797e+06	112.7460	3384.21	84.0	36.8036
45726	6.521	7978.76	2508.57	0.31440	75.8654	1.116725e+06	102.2770	3974.52	54.0	36.0470
45727	10.356	7726.65	2489.58	0.32220	70.9903	1.076560e+06	103.6780	3290.46	46.0	37.4718
45728	9.791	8878.93	3055.78	0.34416	94.0314	1.242266e+06	115.1950	3421.79	41.0	35.6045
45729	18.827	12732.40	4444.36	0.34905	157.6300	1.788897e+06	229.4590	4626.85	141.0	29.8118

45730 rows × 10 columns

In [4]:

exp.data_summary()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

HTML(value='<link rel="stylesheet" href="//stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.…

VBox(children=(HTML(value='Data Shape:(45730, 10)'), Tab(children=(Output(), Output()), _dom_classes=('data-su…

In [5]:

# Select RMSD as the target variable and click "UPDATE"; this is a regression task
exp.data_prepare()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(HBox(children=(VBox(children=(HTML(value='<p>Target Variable:</p>'), HTML(value='<p>Split Metho…

In [6]:

exp.feature_select()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

HBox(children=(Output(), Output()))

VBox(children=(ToggleButtons(layout=Layout(width='100%'), options=('Correlation', 'Distance Correlation', 'Fea…

In [8]:

exp.eda()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(HBox(children=(VBox(children=(HTML(value='<h4>Univariate:</h4>'), HBox(children=(Dropdown(layou…

Stage 2. Train intepretable models ¶

In [9]:

# Choose GLM, GAM, Tree models with default settings, click "RUN" to train; 
# After training is finished, register the three trained models one by one.
exp.model_train()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Box(children=(Box(children=(HTML(value="<h4 style='margin: 10px 0px;'>Choose Model</h4>"), Box(…

Stage 3. Explain and Interpret ¶

In [10]:

# Model-agnostic post-hoc explanation by Permutation Feature Importance, PDP (1D and 2D) vs. ALE (1D and 2D), LIME vs. SHAP
exp.model_explain()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'GAM', 'GLM', 'Tree'), style=Desc…

In [11]:

# Model-specific inherent interpretation including feature importance, main effects and pairwise interactions.
exp.model_interpret()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'GAM', 'GLM', 'Tree'), style=Desc…

Stage 4. Diagnose and Compare¶

In [12]:

exp.model_diagnose()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'GAM', 'GLM', 'Tree'), style=Desc…

In [13]:

exp.model_compare()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(HBox(children=(Dropdown(layout=Layout(width='30%'), options=('Select Model', 'GAM', 'GLM', 'Tre…

In [ ]: