Set-up instructions: On binder, this should run out-of-the-box.
To run locally instead, ensure that skpro
with basic dependency requirements is installed in your python environment.
skpro
provides scikit-learn
-like, scikit-base
compatible interfaces to:
scikit-learn
regressors into probabilistic skpro
regressors, such as bootstrap or conformalpandas.DataFrame
-s and a pandas
-like interfaceSection 1 provides an overview of common probabilistic supervised regression workflows supported by skpro
.
Section 2 gives an more detailed introduction to prediction modes, performance metrics, and benchmarking tools.
Section 3 discusses advanced composition patterns, including various ways to add probabilistic capability to any sklearn
regressor, pipeline building, tuning, ensembling.
Section 4 gives an introduction to how to write custom estimators compliant with the skpro
interface.
# hide warnings
import warnings
warnings.filterwarnings("ignore")
skpro
revolves around supervised probabilistic regressors:
fit(X, y)
with tabular features X
, labels y
, same rows, both pd.DataFrame
predict_interval(X_test, coverage=0.90)
for interval predictions of labelspredict_quantiles(X_test, alpha=[0.05, 0.95])
for quantile predictions of labelspredict_var(X_test)
for variance predictions of labelspredict(X_test)
for mean predictionspredict_proba(X_test)
for distributional predictionskpro
regressors are used via fit
then predict_proba
etc.
Same as sklearn
regressors - X
and y
should be pd.DataFrame
(numpy
is also ok but not recommended)
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from skpro.regression.residual import ResidualDouble
# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)
# step 2: specifying the regressor
# example - random forest for mean prediction
# linear regression for variance prediction
reg_mean = RandomForestRegressor()
reg_resid = LinearRegression()
reg_proba = ResidualDouble(reg_mean, reg_resid)
# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)
# step 4: predicting labels on new data
# probabilistic prediction modes - pick any or multiple
# we show the return types in detail below
# full distribution prediction
y_pred_proba = reg_proba.predict_proba(X_new)
# interval prediction
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)
# quantile prediction
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])
# variance prediction
y_pred_var = reg_proba.predict_var(X_new)
# mean prediction is same as "classical" sklearn predict, also available
y_pred_mean = reg_proba.predict(X_new)
y_pred_proba
is an skpro
distribution - it has index and columns like pd.DataFrame
"we predict that true labels are distributed according to y_pred_proba
"
(here: distribution marginal by row/columns)
y_pred_proba = reg_proba.predict_proba(X_new)
y_pred_proba
Normal(columns=Index(['target'], dtype='object'), index=Index([287, 134, 213, 82, 251, 45, 350, 285, 22, 236, ... 218, 235, 166, 361, 272, 333, 219, 310, 178, 233], dtype='int64', length=111), mu=array([[158.75], [171.91], [ 86.32], [ 94.1 ], [251.72], [100.72], [262.36], [209.04], [ 85.69], [171.23], [182.31], [237.8 ], [179.06], [109.77], [ 90.58], [184.38], [135.62], [179.12], [ 70.68], [178.36]... [16.83771976], [19.10672381], [19.63878908], [16.46098141], [15.70910931], [12.90608099], [19.66465255], [20.89400588], [19.28096697], [22.39693358], [15.26815129], [18.49072135], [16.44625929], [14.43024188], [16.19206731], [21.27391581], [19.50839963], [12.64715474], [13.93531633], [17.53348762], [20.01785524], [19.57531732], [21.54329846], [13.07775327], [13.55384321]]))Please rerun this cell to show the HTML repr or trust the notebook.
Normal(columns=Index(['target'], dtype='object'), index=Index([287, 134, 213, 82, 251, 45, 350, 285, 22, 236, ... 218, 235, 166, 361, 272, 333, 219, 310, 178, 233], dtype='int64', length=111), mu=array([[158.75], [171.91], [ 86.32], [ 94.1 ], [251.72], [100.72], [262.36], [209.04], [ 85.69], [171.23], [182.31], [237.8 ], [179.06], [109.77], [ 90.58], [184.38], [135.62], [179.12], [ 70.68], [178.36]... [16.83771976], [19.10672381], [19.63878908], [16.46098141], [15.70910931], [12.90608099], [19.66465255], [20.89400588], [19.28096697], [22.39693358], [15.26815129], [18.49072135], [16.44625929], [14.43024188], [16.19206731], [21.27391581], [19.50839963], [12.64715474], [13.93531633], [17.53348762], [20.01785524], [19.57531732], [21.54329846], [13.07775327], [13.55384321]]))
skpro
distribution objects are pandas-like
y_pred_proba.shape
(111, 1)
y_pred_proba.index # same index as X_new
Index([287, 134, 213, 82, 251, 45, 350, 285, 22, 236, ... 218, 235, 166, 361, 272, 333, 219, 310, 178, 233], dtype='int64', length=111)
y_pred_proba.columns # same columns as X_new
Index(['target'], dtype='object')
distribution objects have sample
and methods such as mean
, var
:
y_pred_proba.sample().head()
target | |
---|---|
287 | 176.508550 |
134 | 161.825887 |
213 | 94.371796 |
82 | 107.464801 |
251 | 247.966573 |
y_pred_proba.mean().head()
target | |
---|---|
287 | 158.75 |
134 | 171.91 |
213 | 86.32 |
82 | 94.10 |
251 | 251.72 |
y_pred_proba.var().head()
target | |
---|---|
287 | 362.869338 |
134 | 271.792914 |
213 | 289.466582 |
82 | 125.589059 |
251 | 479.705452 |
more details on skpro
distributions in the "distributions" tutorial!
interval prediction y_pred_interval
is a pd.DataFrame
:
X_new
"we predict that value in row falls between bottom/upper with 90% chance"
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)
y_pred_interval.head()
target | ||
---|---|---|
0.9 | ||
lower | upper | |
287 | 127.416970 | 190.083030 |
134 | 144.792708 | 199.027292 |
213 | 58.334925 | 114.305075 |
82 | 75.666697 | 112.533303 |
251 | 215.694121 | 287.745879 |
multiple coverages can be passed:
coverage = [0.5, 0.9, 0.95]
y_pred_ints = reg_proba.predict_interval(X_new, coverage=coverage)
y_pred_ints
target | ||||||
---|---|---|---|---|---|---|
0.50 | 0.90 | 0.95 | ||||
lower | upper | lower | upper | lower | upper | |
287 | 145.901557 | 171.598443 | 127.416970 | 190.083030 | 121.414392 | 196.085608 |
134 | 160.790265 | 183.029735 | 144.792708 | 199.027292 | 139.597753 | 204.222247 |
213 | 74.844422 | 97.795578 | 58.334925 | 114.305075 | 52.973726 | 119.666274 |
82 | 86.541228 | 101.658772 | 75.666697 | 112.533303 | 72.135365 | 116.064635 |
251 | 236.947205 | 266.492795 | 215.694121 | 287.745879 | 208.792518 | 294.647482 |
... | ... | ... | ... | ... | ... | ... |
333 | 176.038162 | 203.041838 | 156.613558 | 222.466442 | 150.305725 | 228.774275 |
219 | 150.266649 | 176.673351 | 131.271468 | 195.668532 | 125.103083 | 201.836917 |
310 | 190.349266 | 219.410734 | 169.444427 | 240.315573 | 162.655911 | 247.104089 |
178 | 77.299189 | 94.940811 | 64.609010 | 107.630990 | 60.488075 | 111.751925 |
233 | 155.038072 | 173.321928 | 141.885912 | 186.474088 | 137.614955 | 190.745045 |
111 rows × 6 columns
predict_interval
output spec:
pandas.DataFrame
Row index is as for X_new
Column has multi-index:
1st level = variable names from y
in fit
2nd level = coverage fractions in coverage
3rd level = string "lower"
or "upper"
Entries = interval prediction of lower/upper interval at nominal coverage in 2nd lvl, for var in 1st lvl, for data index in row
quantile prediction y_pred_quantiles
is a pd.DataFrame
:
X_new
"we predict the 5%, 50%, 95% quantile points for the row to be here"
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])
y_pred_quantiles.head()
target | |||
---|---|---|---|
0.05 | 0.50 | 0.95 | |
287 | 127.416970 | 158.75 | 190.083030 |
134 | 144.792708 | 171.91 | 199.027292 |
213 | 58.334925 | 86.32 | 114.305075 |
82 | 75.666697 | 94.10 | 112.533303 |
251 | 215.694121 | 251.72 | 287.745879 |
multiple quantiles:
alpha = [0.1, 0.25, 0.5, 0.75, 0.9]
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=alpha)
y_pred_quantiles
target | |||||
---|---|---|---|---|---|
0.10 | 0.25 | 0.50 | 0.75 | 0.90 | |
287 | 134.337558 | 145.901557 | 158.75 | 171.598443 | 183.162442 |
134 | 150.782158 | 160.790265 | 171.91 | 183.029735 | 193.037842 |
213 | 64.516044 | 74.844422 | 86.32 | 97.795578 | 108.123956 |
82 | 79.738097 | 86.541228 | 94.10 | 101.658772 | 108.461903 |
251 | 223.651228 | 236.947205 | 251.72 | 266.492795 | 279.788772 |
... | ... | ... | ... | ... | ... |
333 | 163.886086 | 176.038162 | 189.54 | 203.041838 | 215.193914 |
219 | 138.383221 | 150.266649 | 163.47 | 176.673351 | 188.556779 |
310 | 177.271152 | 190.349266 | 204.88 | 219.410734 | 232.488848 |
178 | 69.360185 | 77.299189 | 86.12 | 94.940811 | 102.879815 |
233 | 146.810051 | 155.038072 | 164.18 | 173.321928 | 181.549949 |
111 rows × 5 columns
predict_quantiles
output spec:
pandas.DataFrame
Row index is same as X_new
Column has multi-index:
1st level = variable names from y
in fit
2nd level = quantile points in alpha
Entries = quantiles prediction at quantile point in 2nd lvl, for var in 1st lvl, for data index in row
mean and variance predictions y_pred_mean
, y_pred_var
are pd.DataFrame
-s:
X_new
X_new
entries are predictive mean and variance in row/column
y_pred_mean = reg_proba.predict(X_new)
y_pred_var = reg_proba.predict_var(X_new)
y_pred_mean.head()
target | |
---|---|
287 | 158.75 |
134 | 171.91 |
213 | 86.32 |
82 | 94.10 |
251 | 251.72 |
y_pred_var.head()
target | |
---|---|
287 | 362.869338 |
134 | 271.792914 |
213 | 289.466582 |
82 | 125.589059 |
251 | 479.705452 |
this is the same as taking the distribution prediction and taking mean/variance
(for distribution objects that estimate these precisely)
y_pred_proba.mean().head()
target | |
---|---|
287 | 158.75 |
134 | 171.91 |
213 | 86.32 |
82 | 94.10 |
251 | 251.72 |
y_pred_proba.var().head()
target | |
---|---|
287 | 362.869338 |
134 | 271.792914 |
213 | 289.466582 |
82 | 125.589059 |
251 | 479.705452 |
for simple evaluation:
Note:
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from skpro.metrics import CRPS
from skpro.regression.residual import ResidualDouble
# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# step 2: specifying the regressor
# example - linear regression for mean prediction
# random forest for variance prediction
reg_mean = LinearRegression()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)
# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)
# step 4: predicting labels on new data
y_pred_proba = reg_proba.predict_proba(X_test)
# step 5: specifying evaluation metric
metric = CRPS()
# step 6: evaluat metric, compare predictions to actuals
metric(y_test, y_pred_proba)
29.341032127400844
how do we know that metric is of right type? Via scitype:y_pred
tag
metric.get_tags()
# scitype:y_pred is pred_proba - for proba predictions
{'estimator_type': 'estimator', 'object_type': 'metric', 'reserved_params': ['multioutput', 'score_average'], 'scitype:y_pred': 'pred_proba', 'lower_is_better': True}
how do we find metrics for a prediction type?
from skpro.registry import all_objects
all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")
name | object | scitype:y_pred | |
---|---|---|---|
0 | CRPS | <class 'skpro.metrics._classes.CRPS'> | pred_proba |
1 | ConstraintViolation | <class 'skpro.metrics._classes.ConstraintViola... | pred_interval |
2 | EmpiricalCoverage | <class 'skpro.metrics._classes.EmpiricalCovera... | pred_interval |
3 | LinearizedLogLoss | <class 'skpro.metrics._classes.LinearizedLogLo... | pred_proba |
4 | LogLoss | <class 'skpro.metrics._classes.LogLoss'> | pred_proba |
5 | PinballLoss | <class 'skpro.metrics._classes.PinballLoss'> | pred_quantiles |
6 | SquaredDistrLoss | <class 'skpro.metrics._classes.SquaredDistrLoss'> | pred_proba |
extra note: quantile metrics can be applied to interval predictions as well
more details on metrics below
some useful diagnostic visualisations: variants of crossplots for probabilistic predictions
A. crossplot ground truth vs prediction intervals.
Works with both proba and interval predictions.
What to look for: intervals shouhld cut through the x = y line (green points)
from skpro.utils.plotting import plot_crossplot_interval
plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)
<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>
from skpro.utils.plotting import plot_crossplot_interval
y_pred_interval = reg_proba.predict_interval(X_test, coverage=0.9)
plot_crossplot_interval(y_test, y_pred_interval)
<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>
B. crossplot residuals vs predictive standard deviation
Works with both proba and variance predictions.
What to look for: should be close to a line, high linear correlation
from skpro.utils.plotting import plot_crossplot_std
plot_crossplot_std(y_test, y_pred_proba)
<Axes: xlabel='Absolute errors $|y_i - \\widehat{y}_i|$', ylabel='Predictive standard deviation of $\\widehat{y}_i$'>
from skpro.utils.plotting import plot_crossplot_std
y_pred_var = reg_proba.predict_var(X_test)
plot_crossplot_std(y_test, y_pred_var)
<Axes: xlabel='Absolute errors $|y_i - \\widehat{y}_i|$', ylabel='Predictive standard deviation of $\\widehat{y}_i$'>
C. crossplot ground truth vs loss values
Loss and prediction type should agree.
What to look for: association between accuracy and ground truth value
Diagnostic of which values we can predict more accurately,
e.g., to inform modelling or identify unusual outliers
from skpro.utils.plotting import plot_crossplot_loss
crps_metric = CRPS()
plot_crossplot_loss(y_test, y_pred_proba, crps_metric)
<Axes: title={'center': 'mean CRPS: 29.34 +/- 1.97 sterr of mean'}, xlabel='Correct label $y_i$', ylabel='CRPS($y_i$, $\\widehat{y}_i$)'>
skpro
objects - scikit-base
interface, searching for regressors and metrics¶skpro
objects - skbase
interface points get_tags
, get_params
/set_params
all_objects
skpro
object interface ¶metrics and estimators are first-class citizens in skpro
, with a scikit-base
compatible interface
# example object 1: CRPS metric
from skpro.metrics import CRPS
crps_metric = CRPS()
# example object 2: ResidualDouble regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from skpro.regression.residual import ResidualDouble
reg_mean = LinearRegression()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)
e.g., all have get_tags
interface
crps_metric.get_tags()
{'estimator_type': 'estimator', 'object_type': 'metric', 'reserved_params': ['multioutput', 'score_average'], 'scitype:y_pred': 'pred_proba', 'lower_is_better': True}
reg_proba.get_tags()
{'estimator_type': 'regressor_proba', 'object_type': 'regressor_proba', 'capability:multioutput': False, 'capability:missing': True}
the tag object_type
indicates the type of object, e.g., metric or proba regressor
all objects also have the get_params
/set_params
interface known from scikit-learn
= reading or setting hyper-parameters
get_params
returns dict
{paramname: paramvalue}
; set_params
writes it
crps_metric.get_params()
{'multioutput': 'uniform_average', 'multivariate': False}
composite objects have the nested param interface, keys componentname__paramname
# note that reg_proba has components LinearRegression and RandomForestaregressor
# each with their own parameters
reg_proba
ResidualDouble(estimator=LinearRegression(), estimator_resid=RandomForestRegressor())Please rerun this cell to show the HTML repr or trust the notebook.
ResidualDouble(estimator=LinearRegression(), estimator_resid=RandomForestRegressor())
LinearRegression()
RandomForestRegressor()
so reg_proba
will have parameters coming from itself and either component:
reg_proba.get_params()
{'cv': None, 'distr_loc_scale_name': None, 'distr_params': None, 'distr_type': 'Normal', 'estimator': LinearRegression(), 'estimator_resid': RandomForestRegressor(), 'min_scale': 1e-10, 'residual_trafo': 'absolute', 'use_y_pred': False, 'estimator__copy_X': True, 'estimator__fit_intercept': True, 'estimator__n_jobs': None, 'estimator__normalize': 'deprecated', 'estimator__positive': False, 'estimator_resid__bootstrap': True, 'estimator_resid__ccp_alpha': 0.0, 'estimator_resid__criterion': 'squared_error', 'estimator_resid__max_depth': None, 'estimator_resid__max_features': 1.0, 'estimator_resid__max_leaf_nodes': None, 'estimator_resid__max_samples': None, 'estimator_resid__min_impurity_decrease': 0.0, 'estimator_resid__min_samples_leaf': 1, 'estimator_resid__min_samples_split': 2, 'estimator_resid__min_weight_fraction_leaf': 0.0, 'estimator_resid__n_estimators': 100, 'estimator_resid__n_jobs': None, 'estimator_resid__oob_score': False, 'estimator_resid__random_state': None, 'estimator_resid__verbose': 0, 'estimator_resid__warm_start': False}
further common interface points are get_config
, set_config
, and get_fitted_params
(only fittable estimators)
as first-class citizens, all objects in skpro
are indexed via the registry
utility all_objects
.
To find probabilistic supervised regressors, use all_objects
with the type regressor_proba
:
from skpro.registry import all_objects
all_objects("regressor_proba", as_dataframe=True).head()
name | object | |
---|---|---|
0 | BaggingRegressor | <class 'skpro.regression.ensemble.BaggingRegre... |
1 | BootstrapRegressor | <class 'skpro.regression.bootstrap.BootstrapRe... |
2 | GridSearchCV | <class 'skpro.model_selection._tuning.GridSear... |
3 | Pipeline | <class 'skpro.regression.compose._pipeline.Pip... |
4 | RandomizedSearchCV | <class 'skpro.model_selection._tuning.Randomiz... |
a full list can also be found in the online API reference.
for metrics, as seen above:
from skpro.registry import all_objects
all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")
name | object | scitype:y_pred | |
---|---|---|---|
0 | CRPS | <class 'skpro.metrics._classes.CRPS'> | pred_proba |
1 | ConstraintViolation | <class 'skpro.metrics._classes.ConstraintViola... | pred_interval |
2 | EmpiricalCoverage | <class 'skpro.metrics._classes.EmpiricalCovera... | pred_interval |
3 | LinearizedLogLoss | <class 'skpro.metrics._classes.LinearizedLogLo... | pred_proba |
4 | LogLoss | <class 'skpro.metrics._classes.LogLoss'> | pred_proba |
5 | PinballLoss | <class 'skpro.metrics._classes.PinballLoss'> | pred_quantiles |
6 | SquaredDistrLoss | <class 'skpro.metrics._classes.SquaredDistrLoss'> | pred_proba |
all tags can be printed by the all_tags
utility:
# all tags applicable to metrics
from skpro.registry import all_tags
all_tags("metric", as_dataframe=True)
name | scitype | type | description | |
---|---|---|---|---|
0 | lower_is_better | metric | bool | whether lower (True) or higher (False) is better |
1 | scitype:y_pred | metric | str | expected input type for y_pred in performance ... |
# all tags applicable to probabilistic regressors
from skpro.registry import all_tags
all_tags("regressor_proba", as_dataframe=True)
name | scitype | type | description | |
---|---|---|---|---|
0 | capability:missing | regressor_proba | bool | whether estimator supports missing values |
1 | capability:multioutput | regressor_proba | bool | whether estimator supports multioutput regression |
filtering in search can be done with the filter_tags
argument in all_objects
, see docstring:
from skpro.registry import all_objects
# "retrieve all genuinely probabilistic loss functions"
all_objects("metric", as_dataframe=True, filter_tags={"scitype:y_pred": "pred_proba"})
name | object | |
---|---|---|
0 | CRPS | <class 'skpro.metrics._classes.CRPS'> |
1 | LinearizedLogLoss | <class 'skpro.metrics._classes.LinearizedLogLo... |
2 | LogLoss | <class 'skpro.metrics._classes.LogLoss'> |
3 | SquaredDistrLoss | <class 'skpro.metrics._classes.SquaredDistrLoss'> |
This section gives more details on:
readers familir with, or less interested in theory, may like to skip section 2.1
In supervised learning - probabilistic or not:
Let $y$ be the (true) value, for an observed feature $x$
(we consider $y$ a random variable)
Name | param | prediction/estimate of | skpro |
---|---|---|---|
point prediction | conditional expectation $\mathbb{E}[y|x]$ | predict |
|
variance prediction | conditional variance $Var[y|x]$ | predict_var |
|
quantile prediction | $\alpha\in (0,1)$ | $\alpha$-quantile of $y|x$ | predict_quantiles |
interval prediction | $c\in (0,1)$ | $[a,b]$ s.t. $P(a\le y \le b| x) = c$ | predict_interval |
distribution prediction | the law/distribution of $y|x$ | predict_proba |
let's consider the toy example again
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from skpro.regression.residual import ResidualDouble
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)
reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_new)
Intuition: "out of many repetitions/worlds, this value is the arithmetic average of all observations".
# if y_pred_proba were *true*, here's how many repetitions would look like:
# repeating this line is "one repetition"
y_pred_proba.sample().head()
target | |
---|---|
421 | 216.595619 |
81 | 122.436986 |
326 | 147.317343 |
105 | 98.226950 |
416 | 171.539410 |
many_samples = y_pred_proba.sample(100)
many_samples
target | ||
---|---|---|
0 | 421 | 174.503233 |
81 | 128.299773 | |
326 | 174.547102 | |
105 | 85.523671 | |
416 | 166.025376 | |
... | ... | ... |
99 | 399 | 208.666436 |
109 | 188.566325 | |
22 | 76.930673 | |
174 | 156.354496 | |
106 | 96.456699 |
11100 rows × 1 columns
# "doing many times and taking the mean" -> usual point prediction
mean_prediction = many_samples.groupby(level=1, sort=False).mean()
mean_prediction.head()
target | |
---|---|
421 | 213.432031 |
81 | 126.286757 |
326 | 173.065740 |
105 | 100.126654 |
416 | 178.876668 |
# if we would do this infinity times instead of 100:
y_pred_proba.mean().head()
target | |
---|---|
421 | 213.57 |
81 | 122.14 |
326 | 172.48 |
105 | 98.97 |
416 | 180.31 |
Intuition: "out of many repetitions/worlds, this value is the average squared distance of the observation to the perfect point prediction".
# same as above - take many samples, and then compute element-wise statistics
var_prediction = many_samples.groupby(level=1, sort=False).var()
var_prediction.head()
target | |
---|---|
421 | 237.798822 |
81 | 262.881415 |
326 | 275.523556 |
105 | 401.240249 |
416 | 267.851337 |
# e.g., predict_var should give the same result as infinite large sample's variance
y_pred_proba.var().head()
target | |
---|---|
421 | 289.154099 |
81 | 289.154099 |
326 | 289.154099 |
105 | 289.154099 |
416 | 289.154099 |
Intuition: "out of many repetitions/worlds, a fraction of exactly $\alpha$ will have equal or smaller than this value."
Intuition: "out of many repetitions/worlds, a fraction of exactly $c$ will be contained in the interval $[a,b]$, and being above is equally likely as being below".
(similar - exercise left to the reader)
Intuition: exhaustive description of the generating mechanism of many repetitions/worlds.
note: the true distribution is unknown, and not accessible easily!
y_pred_proba
is a distribution, but in general not equal to the true one!
that is, there are:
y_pred_proba_true
- unknown and unknowable but estimabley_pred_proba
- our guess at y_pred_proba_true
y_true
is one y_pred_proba_true.sample()
predict
produces guess of y_pred_proba_true.mean()
predict_var
produces guess of y_pred_proba_true.var()
predict_quantiles([0.05, 0.5, 0.95])
produces guess of y_pred_proba_true.quantiles([0.05, 0.5, 0.95])
predict_proba
produces guess of y_pred_proba_true
the guesses are algorithm specific, and some algorithms are more accurate than others, given data
General usage pattern same as for sklearn
metrics:
but: need to use dedicated metric for probabilistic predictions
y_true
samplesy_predict_proba
, y_predict_interval
metric(y_true: 2D pd.DataFrame, y_pred: proba_prediction_type) -> float
Recall methods available for all probabilistic regressors:
predict_interval
produces interval predictions.
Argument coverage
(nominal interval coverage) must be provided.predict_quantiles
produces quantile predictions.
Argument alpha
(quantile values) must be provided.predict_var
produces variance predictions. Same args as predict
.predict_proba
produces full distributional predictions. Same args as predict
.Name | param | prediction/estimate of | skpro |
---|---|---|---|
point prediction | conditional expectation $\mathbb{E}[y|x]$ | predict |
|
variance prediction | conditional variance $Var[y|x]$ | predict_var |
|
quantile prediction | $\alpha\in (0,1)$ | $\alpha$-quantile of $y|x$ | predict_quantiles |
interval prediction | $c\in (0,1)$ | $[a,b]$ s.t. $P(a\le y \le b| x) = c$ | predict_interval |
distribution prediction | the law/distribution of $y|x$ | predict_proba |
let's produce some probabilistic predictions!
# 1. get some actuals and predictions
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# actuals = y_test
from sklearn.ensemble import RandomForestRegressor
from skpro.regression.residual import ResidualDouble
reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)
reg_proba.fit(X_train, y_train)
# use any of the probabilistic methods, we have seen this
y_pred_int = reg_proba.predict_interval(X_test, coverage=0.95)
y_pred_q = reg_proba.predict_quantiles(X_test, alpha=[0.05, 0.95])
y_pred_proba = reg_proba.predict_proba(X_test)
recall, all have their own output format:
y_pred_int # lower/upper intervals
target | ||
---|---|---|
0.95 | ||
lower | upper | |
104 | 112.79411 | 180.58589 |
334 | 63.02411 | 130.81589 |
271 | 116.11411 | 183.90589 |
342 | 155.14411 | 222.93589 |
243 | 51.14411 | 118.93589 |
... | ... | ... |
9 | 79.16411 | 146.95589 |
46 | 86.54411 | 154.33589 |
154 | 136.21411 | 204.00589 |
402 | 133.12411 | 200.91589 |
284 | 145.72411 | 213.51589 |
111 rows × 2 columns
y_pred_q # quantiles
target | ||
---|---|---|
0.05 | 0.95 | |
104 | 118.243673 | 175.136327 |
334 | 68.473673 | 125.366327 |
271 | 121.563673 | 178.456327 |
342 | 160.593673 | 217.486327 |
243 | 56.593673 | 113.486327 |
... | ... | ... |
9 | 84.613673 | 141.506327 |
46 | 91.993673 | 148.886327 |
154 | 141.663673 | 198.556327 |
402 | 138.573673 | 195.466327 |
284 | 151.173673 | 208.066327 |
111 rows × 2 columns
y_pred_proba # sktime/skpro BaseDistribution
Normal(columns=Index(['target'], dtype='object'), index=Index([104, 334, 271, 342, 243, 120, 183, 333, 315, 357, ... 313, 111, 137, 198, 413, 9, 46, 154, 402, 284], dtype='int64', length=111), mu=array([[146.69], [ 96.92], [150.01], [189.04], [ 85.04], [ 91.04], [202.31], [226.03], [ 90.7 ], [153.92], [263.72], [139.12], [ 81.82], [138.16], [144.98], [ 90.28], [229.32], [200.46], [ 83.39], [196.02],... [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897]]))Please rerun this cell to show the HTML repr or trust the notebook.
Normal(columns=Index(['target'], dtype='object'), index=Index([104, 334, 271, 342, 243, 120, 183, 333, 315, 357, ... 313, 111, 137, 198, 413, 9, 46, 154, 402, 284], dtype='int64', length=111), mu=array([[146.69], [ 96.92], [150.01], [189.04], [ 85.04], [ 91.04], [202.31], [226.03], [ 90.7 ], [153.92], [263.72], [139.12], [ 81.82], [138.16], [144.98], [ 90.28], [229.32], [200.46], [ 83.39], [196.02],... [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897], [17.29413897]]))
we now need to apply a suitable metric, metric(y_test, y_pred)
IMPORTANT: sequence matters, y_test
first; y_pred
has very different type!
# 2. specify metric
# CRPS = continuous ranked probability score, for distribution predictions
from skpro.metrics import CRPS
crps = CRPS()
# 3. evaluate metric
crps(y_test, y_pred_proba)
40.363339883507926
how do we find a metric that fits the prediction type?
answer: metrics are tagged
important tag: scitype:y_pred
"pred_proba"
- distributional, can applied to distributions, predict_proba
output"pred_quantiles"
- quantile forecast metric, can be applied to quantile predictions, interval predictions, distributional predictionspredict_quantiles
, predict_interval
, predict_proba
outputs"pred_interval"
- interval forecast metric, can be applied to interval predictions, distributional predictionspredict_interval
, predict_proba
outputscrps.get_tags()
{'estimator_type': 'estimator', 'object_type': 'metric', 'reserved_params': ['multioutput', 'score_average'], 'scitype:y_pred': 'pred_proba', 'lower_is_better': True}
listing metrics with the tag, filtering for probabilistic tags:
(let's try to find a quantile prediction metric!)
from skpro.registry import all_objects
all_objects(
"metric",
as_dataframe=True,
return_tags="scitype:y_pred",
)
name | object | scitype:y_pred | |
---|---|---|---|
0 | CRPS | <class 'skpro.metrics._classes.CRPS'> | pred_proba |
1 | ConstraintViolation | <class 'skpro.metrics._classes.ConstraintViola... | pred_interval |
2 | EmpiricalCoverage | <class 'skpro.metrics._classes.EmpiricalCovera... | pred_interval |
3 | LinearizedLogLoss | <class 'skpro.metrics._classes.LinearizedLogLo... | pred_proba |
4 | LogLoss | <class 'skpro.metrics._classes.LogLoss'> | pred_proba |
5 | PinballLoss | <class 'skpro.metrics._classes.PinballLoss'> | pred_quantiles |
6 | SquaredDistrLoss | <class 'skpro.metrics._classes.SquaredDistrLoss'> | pred_proba |
PinballLoss
is a quantile forecast metric:
from skpro.metrics import PinballLoss
pinball_loss = PinballLoss()
pinball_loss(y_test, y_pred_q)
13.648123260286354
... this is by default an average (grand average, float)
y_pred
/ y_test
(rows)alpha
values, quantile pointswhat if we don't want these averages?
multioutput
arg."raw_values"
prevents averaging, "uniform_average"
computes arithmetic mean.alpha
) or coverage (coverage
) is controlled by score_average
argevaluate_by_index
method# Example 1: Pinball loss by quantile point
loss_multi = PinballLoss(score_average=False)
loss_multi(y_test, y_pred_q)
0.05 13.133967 0.95 14.162279 Name: 0, dtype: float64
# Example 2: CRPS by test sample index
crps.evaluate_by_index(y_test, y_pred_proba)
target | |
---|---|
104 | 41.946574 |
334 | 16.320990 |
271 | 14.728135 |
342 | 6.761400 |
243 | 28.452058 |
... | ... |
9 | 187.182827 |
46 | 59.803051 |
154 | 18.026285 |
402 | 4.063702 |
284 | 15.229778 |
111 rows × 1 columns
Caveat: not every metric is an average over time points, e.g., RMSE
In this case, evaluate_by_index
computes jackknife pseudo-samples
(for mean statistics, jackknife pseudo-samples are equal to individual samples)
for quick evaluation and benchmarking,
the benchmarking.evaluate
utility can be used:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.residual import ResidualDouble
# 1. specify dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)
# 2. specify estimator
estimator = ResidualDouble(LinearRegression())
# 3. specify cross-validation schema
cv = KFold(n_splits=3)
# 4. specify evaluation metric
crps = CRPS()
# 5. evaluate - run the benchmark
results = evaluate(estimator=estimator, X=X, y=y, cv=cv, scoring=crps)
# results are pd.DataFrame
# each row is one repetition of the cross-validation on one fold fit/predict/evaluate
# columns report performance, runtime, and other optional information (see docstring)
results
test_CRPS | fit_time | pred_time | len_y_train | |
---|---|---|---|---|
0 | 31.783451 | 0.002778 | 0.001045 | 294 |
1 | 33.574329 | 0.003086 | 0.001094 | 295 |
2 | 29.909655 | 0.003807 | 0.001278 | 295 |
we introduce a number of composition patterns available in skpro
:
sklearn
regressors into probabilistic onessklearn
transformers with skpro
regressorsskpro
probabilistic regressors via grid/random search, minimizing a probabilistic metricskpro
probabilistic regressorsdata used in this section:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
evaluation metric used in this section:
crps = CRPS()
sklearn
regressors probabilistic ¶there are many common algorithms that turn a non-probabilistic tabular regressor probabilistic
formally, this is a type of "reduction" - of probabilistic supervised tabular to non-probabilistic supervised tabular
Examples:
ResidualDouble
with standard settingsResidualDouble
BootstrapRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
# estimator specification - use any sklearn regressor for reg_mean
reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV
# note - in-sample predictions will usually underestimate the variance!
# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)
# evaluate
crps(y_test, y_pred_proba)
36.25453698363302
from skpro.utils.plotting import plot_crossplot_interval
plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)
<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
# estimator specification - use any sklearn regressor for reg_mean and reg_resid
reg_mean = RandomForestRegressor()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, estimator_resid=reg_resid, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV
# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)
# evaluate
crps(y_test, y_pred_proba)
36.860503015336164
from skpro.utils.plotting import plot_crossplot_interval
plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)
<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>
from sklearn.linear_model import LinearRegression
from skpro.regression.bootstrap import BootstrapRegressor
# estimator specification - use any sklearn regressor for reg_mean
reg_mean = LinearRegression()
reg_proba = BootstrapRegressor(reg_mean, n_bootstrap_samples=100)
# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)
# evaluate
crps(y_test, y_pred_proba)
43.13652636195608
from skpro.utils.plotting import plot_crossplot_interval
plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)
<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>
skpro
regressor and sklearn
transformers ¶skpro
regressors can be pipelined with sklearn
transformers, using the skpro
pipeline.
This ensure presence of predict_proba
etc in the pipeline object.
The syntax is exactly the same as for sklearn
's pipeline.
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.impute import SimpleImputer as Imputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from skpro.regression.compose import Pipeline
from skpro.regression.residual import ResidualDouble
# estimator specification
reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)
# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
steps=[
("imputer", Imputer()), # an sklearn transformer
("scaler", MinMaxScaler()), # an sklearn transformer
("regressor", reg_proba), # an skpro regressor
]
)
pipe
Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler()), ('regressor', ResidualDouble(estimator=LinearRegression()))])Please rerun this cell to show the HTML repr or trust the notebook.
Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler()), ('regressor', ResidualDouble(estimator=LinearRegression()))])
SimpleImputer()
MinMaxScaler()
ResidualDouble(estimator=LinearRegression())
LinearRegression()
# the pipeline behaves as any skpro regressor
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X=X_test)
y_pred_proba = pipe.predict_proba(X=X_test)
the pipeline provides the familiar nested get_params
, set_params
interface:
nested parameters are keyed componentname__parametername
pipe.get_params()
{'steps': [('imputer', SimpleImputer()), ('scaler', MinMaxScaler()), ('regressor', ResidualDouble(estimator=LinearRegression()))], 'imputer': SimpleImputer(), 'scaler': MinMaxScaler(), 'regressor': ResidualDouble(estimator=LinearRegression()), 'imputer__add_indicator': False, 'imputer__copy': True, 'imputer__fill_value': None, 'imputer__missing_values': nan, 'imputer__strategy': 'mean', 'imputer__verbose': 'deprecated', 'scaler__clip': False, 'scaler__copy': True, 'scaler__feature_range': (0, 1), 'regressor__cv': None, 'regressor__distr_loc_scale_name': None, 'regressor__distr_params': None, 'regressor__distr_type': 'Normal', 'regressor__estimator': LinearRegression(), 'regressor__estimator_resid': None, 'regressor__min_scale': 1e-10, 'regressor__residual_trafo': 'absolute', 'regressor__use_y_pred': False, 'regressor__estimator__copy_X': True, 'regressor__estimator__fit_intercept': True, 'regressor__estimator__n_jobs': None, 'regressor__estimator__normalize': 'deprecated', 'regressor__estimator__positive': False}
pipelines can also be created via simple lists of estimators,
in this case names are generated automatically:
# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
steps=[
Imputer(), # an sklearn transformer
MinMaxScaler(), # an sklearn transformer
reg_proba, # an skpro regressor
]
)
skpro
regressors via grid and random search ¶skpro
provides grid and random search tuners to tune arbitrary probabilistic regressors,
using probabilistic metrics. Besides this, they function as the sklearn
tuners do.
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from skpro.metrics import CRPS
from skpro.model_selection import GridSearchCV
from skpro.regression.residual import ResidualDouble
# cross-validation specification for tuner
cv = KFold(n_splits=3)
# estimator to be tuned
estimator = ResidualDouble(LinearRegression())
# tuning grid - do we fit an intercept in the linear regression?
param_grid = {"estimator__fit_intercept": [True, False]}
# metric to be optimized
crps_metric = CRPS()
# specification of the grid search tuner
gscv = GridSearchCV(
estimator=estimator,
param_grid=param_grid,
cv=cv,
scoring=crps_metric,
)
gscv
GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=False), estimator=ResidualDouble(estimator=LinearRegression()), param_grid={'estimator__fit_intercept': [True, False]}, scoring=CRPS())Please rerun this cell to show the HTML repr or trust the notebook.
GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=False), estimator=ResidualDouble(estimator=LinearRegression()), param_grid={'estimator__fit_intercept': [True, False]}, scoring=CRPS())
LinearRegression()
CRPS()
the grid search tuner behaves like any skpro
probabilistic regressor:
gscv.fit(X_train, y_train)
y_pred = gscv.predict(X_test)
y_pred_proba = gscv.predict_proba(X_test)
random search is similar, except that instead of a grid a parameter sampler should be specified:
from skpro.model_selection import RandomizedSearchCV
# only difference to GridSearchCV is the param_distributions argument
# specification of the random search parameter sampler
param_distributions = {"estimator__fit_intercept": [True, False]}
# specification of the random search tuner
rscv = RandomizedSearchCV(
estimator=estimator,
param_distributions=param_distributions,
cv=cv,
scoring=crps_metric,
)
Classical bagging does the following, for a wrapped estimator:
In fit
:
X
, y
to X_subs
, y_subs
X_subs
, y_subs
n_estimators
times, store that many fitted clones.In predict
, for X_test
:
X_test
- these are distributionsfrom sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LinearRegression
from skpro.regression.ensemble import BaggingRegressor
from skpro.regression.residual import ResidualDouble
reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)
ens = BaggingRegressor(reg_proba, n_estimators=10)
ens.fit(X_train, y_train)
y_pred = ens.predict_proba(X_test)
# y_pred is a mixture distribution!
str(y_pred)
"Mixture(columns=Index(['target'], dtype='object'),\n distributions=[Normal(columns=Index(['target'], dtype='object'),\n index=Index([367, 248, 134, 148, 21, 310, 37, 305, 23, 373,\n ...\n 365, 204, 309, 379, 7, 440, 178, 160, 119, 441],\n dtype='int64', length=111),\n mu=array([[270.91496133],\n [227.0209857 ],\n [153.30255219],\n [133.24435057],\n [108.55506881],\n [206.51166703],\n [153.83348539],\n [120.5...\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484],\n [45.96784484]]))],\n index=Index([367, 248, 134, 148, 21, 310, 37, 305, 23, 373,\n ...\n 365, 204, 309, 379, 7, 440, 178, 160, 119, 441],\n dtype='int64', length=111))"
[type(x) for x in y_pred.distributions]
[skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal, skpro.distributions.normal.Normal]
skpro
is meant to be easily extensible, for direct contribution to skpro
as well as for local/private extension with custom methods.
To get started:
python
file with todo
blocks that mark the places in which changes need to be added.skpro
or affiliated repository (if contributed extension), inside skpro.regression
; rename the file and update the file docstring appropriately.__init__
, _fit
, and at least one of the probabilistic prediction methods, preferably _predict_proba
(for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.skpro.utils.check_estimator
on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the get_test_params
method, according to the extension template.In case of direct contribution to skpro
or one of its affiliated packages, additionally:
CODEOWNERS
for the new estimator file(s).skpro
is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with scikit-learn
and scikit-base
interface specifications.
skpro
comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as scikit-learn
and individual algorithm libraries.
skpro
is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles.