skpro introduction notebook¶

Set-up instructions: On binder, this should run out-of-the-box.

To run locally instead, ensure that skpro with basic dependency requirements is installed in your python environment.

skpro provides scikit-learn-like, scikit-base compatible interfaces to:

tabular supervised regressors with probabilistic prediction modes - interval, quantile and distribution predictions
performance metrics to evaluate probabilistic predictions, e.g., pinball loss, empirical coverage, CRPS
reductions to turn non-probabilistic, scikit-learn regressors into probabilistic skpro regressors, such as bootstrap or conformal
tools for building pipelines and composite machine learning models, including tuning via probabilistic performance metrics
symbolic an lazy probability distributions with a value domain of pandas.DataFrame-s and a pandas-like interface

Section 1 provides an overview of common probabilistic supervised regression workflows supported by skpro.

Section 2 gives an more detailed introduction to prediction modes, performance metrics, and benchmarking tools.

Section 3 discusses advanced composition patterns, including various ways to add probabilistic capability to any sklearn regressor, pipeline building, tuning, ensembling.

Section 4 gives an introduction to how to write custom estimators compliant with the skpro interface.

In [1]:

# hide warnings
import warnings

warnings.filterwarnings("ignore")

1. Basic probabilistic supervised regression workflows ¶

skpro revolves around supervised probabilistic regressors:

fit(X, y) with tabular features X, labels y, same rows, both pd.DataFrame
predict_interval(X_test, coverage=0.90) for interval predictions of labels
predict_quantiles(X_test, alpha=[0.05, 0.95]) for quantile predictions of labels
predict_var(X_test) for variance predictions of labels
predict(X_test) for mean predictions
predict_proba(X_test) for distributional prediction

1.1 basic deployment workflow¶

skpro regressors are used via fit then predict_proba etc.

Same as sklearn regressors - X and y should be pd.DataFrame (numpy is also ok but not recommended)

In [2]:

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from skpro.regression.residual import ResidualDouble

# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)

# step 2: specifying the regressor
# example - random forest for mean prediction
# linear regression for variance prediction
reg_mean = RandomForestRegressor()
reg_resid = LinearRegression()
reg_proba = ResidualDouble(reg_mean, reg_resid)

# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)

# step 4: predicting labels on new data

# probabilistic prediction modes - pick any or multiple
# we show the return types in detail below

# full distribution prediction
y_pred_proba = reg_proba.predict_proba(X_new)

# interval prediction
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)

# quantile prediction
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])

# variance prediction
y_pred_var = reg_proba.predict_var(X_new)

# mean prediction is same as "classical" sklearn predict, also available
y_pred_mean = reg_proba.predict(X_new)

1.1.1 distribution predictions¶

y_pred_proba is an skpro distribution - it has index and columns like pd.DataFrame

"we predict that true labels are distributed according to y_pred_proba"

(here: distribution marginal by row/columns)

In [3]:

y_pred_proba = reg_proba.predict_proba(X_new)
y_pred_proba

Out[3]:

Normal(columns=Index(['target'], dtype='object'),
       index=Index([287, 134, 213,  82, 251,  45, 350, 285,  22, 236,
       ...
       218, 235, 166, 361, 272, 333, 219, 310, 178, 233],
      dtype='int64', length=111),
       mu=array([[158.75],
       [171.91],
       [ 86.32],
       [ 94.1 ],
       [251.72],
       [100.72],
       [262.36],
       [209.04],
       [ 85.69],
       [171.23],
       [182.31],
       [237.8 ],
       [179.06],
       [109.77],
       [ 90.58],
       [184.38],
       [135.62],
       [179.12],
       [ 70.68],
       [178.36]...
       [16.83771976],
       [19.10672381],
       [19.63878908],
       [16.46098141],
       [15.70910931],
       [12.90608099],
       [19.66465255],
       [20.89400588],
       [19.28096697],
       [22.39693358],
       [15.26815129],
       [18.49072135],
       [16.44625929],
       [14.43024188],
       [16.19206731],
       [21.27391581],
       [19.50839963],
       [12.64715474],
       [13.93531633],
       [17.53348762],
       [20.01785524],
       [19.57531732],
       [21.54329846],
       [13.07775327],
       [13.55384321]]))

Please rerun this cell to show the HTML repr or trust the notebook.

skpro distribution objects are pandas-like

In [4]:

y_pred_proba.shape

Out[4]:

(111, 1)

In [5]:

y_pred_proba.index  # same index as X_new

Out[5]:

Index([287, 134, 213,  82, 251,  45, 350, 285,  22, 236,
       ...
       218, 235, 166, 361, 272, 333, 219, 310, 178, 233],
      dtype='int64', length=111)

In [6]:

y_pred_proba.columns  # same columns as X_new

Out[6]:

Index(['target'], dtype='object')

distribution objects have sample and methods such as mean, var:

In [7]:

y_pred_proba.sample().head()

Out[7]:

	target
287	176.508550
134	161.825887
213	94.371796
82	107.464801
251	247.966573

In [8]:

y_pred_proba.mean().head()

Out[8]:

	target
287	158.75
134	171.91
213	86.32
82	94.10
251	251.72

In [9]:

y_pred_proba.var().head()

Out[9]:

	target
287	362.869338
134	271.792914
213	289.466582
82	125.589059
251	479.705452

more details on skpro distributions in the "distributions" tutorial!

1.1.2 interval predictions¶

interval prediction y_pred_interval is a pd.DataFrame:

rows are the same as X_new
columns indicate variables, nominal coverage, and bottom/upper bound

"we predict that value in row falls between bottom/upper with 90% chance"

In [10]:

y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)
y_pred_interval.head()

Out[10]:

	target
	0.9
	lower	upper
287	127.416970	190.083030
134	144.792708	199.027292
213	58.334925	114.305075
82	75.666697	112.533303
251	215.694121	287.745879

multiple coverages can be passed:

In [11]:

coverage = [0.5, 0.9, 0.95]
y_pred_ints = reg_proba.predict_interval(X_new, coverage=coverage)
y_pred_ints

Out[11]:

	target
	0.50		0.90		0.95
	lower	upper	lower	upper	lower	upper
287	145.901557	171.598443	127.416970	190.083030	121.414392	196.085608
134	160.790265	183.029735	144.792708	199.027292	139.597753	204.222247
213	74.844422	97.795578	58.334925	114.305075	52.973726	119.666274
82	86.541228	101.658772	75.666697	112.533303	72.135365	116.064635
251	236.947205	266.492795	215.694121	287.745879	208.792518	294.647482
...	...	...	...	...	...	...
333	176.038162	203.041838	156.613558	222.466442	150.305725	228.774275
219	150.266649	176.673351	131.271468	195.668532	125.103083	201.836917
310	190.349266	219.410734	169.444427	240.315573	162.655911	247.104089
178	77.299189	94.940811	64.609010	107.630990	60.488075	111.751925
233	155.038072	173.321928	141.885912	186.474088	137.614955	190.745045

111 rows × 6 columns

predict_interval output spec:

pandas.DataFrame
Row index is as for X_new
Column has multi-index:
1st level = variable names from y in fit
2nd level = coverage fractions in coverage
3rd level = string "lower" or "upper"

Entries = interval prediction of lower/upper interval at nominal coverage in 2nd lvl, for var in 1st lvl, for data index in row

1.1.3 quantile predictions¶

quantile prediction y_pred_quantiles is a pd.DataFrame:

rows are the same as X_new
columns indicate variables, quantile points

"we predict the 5%, 50%, 95% quantile points for the row to be here"

In [12]:

y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])
y_pred_quantiles.head()

Out[12]:

	target
	0.05	0.50	0.95
287	127.416970	158.75	190.083030
134	144.792708	171.91	199.027292
213	58.334925	86.32	114.305075
82	75.666697	94.10	112.533303
251	215.694121	251.72	287.745879

multiple quantiles:

In [13]:

alpha = [0.1, 0.25, 0.5, 0.75, 0.9]
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=alpha)
y_pred_quantiles

Out[13]:

	target
	0.10	0.25	0.50	0.75	0.90
287	134.337558	145.901557	158.75	171.598443	183.162442
134	150.782158	160.790265	171.91	183.029735	193.037842
213	64.516044	74.844422	86.32	97.795578	108.123956
82	79.738097	86.541228	94.10	101.658772	108.461903
251	223.651228	236.947205	251.72	266.492795	279.788772
...	...	...	...	...	...
333	163.886086	176.038162	189.54	203.041838	215.193914
219	138.383221	150.266649	163.47	176.673351	188.556779
310	177.271152	190.349266	204.88	219.410734	232.488848
178	69.360185	77.299189	86.12	94.940811	102.879815
233	146.810051	155.038072	164.18	173.321928	181.549949

111 rows × 5 columns

predict_quantiles output spec:

pandas.DataFrame
Row index is same as X_new
Column has multi-index:
1st level = variable names from y in fit
2nd level = quantile points in alpha

Entries = quantiles prediction at quantile point in 2nd lvl, for var in 1st lvl, for data index in row

1.1.4 mean and variance predictions¶

mean and variance predictions y_pred_mean, y_pred_var are pd.DataFrame-s:

rows are the same as X_new
columns are the same as X_new

entries are predictive mean and variance in row/column

In [14]:

y_pred_mean = reg_proba.predict(X_new)
y_pred_var = reg_proba.predict_var(X_new)

In [15]:

y_pred_mean.head()

Out[15]:

	target
287	158.75
134	171.91
213	86.32
82	94.10
251	251.72

In [16]:

y_pred_var.head()

Out[16]:

	target
287	362.869338
134	271.792914
213	289.466582
82	125.589059
251	479.705452

this is the same as taking the distribution prediction and taking mean/variance

(for distribution objects that estimate these precisely)

In [17]:

y_pred_proba.mean().head()

Out[17]:

	target
287	158.75
134	171.91
213	86.32
82	94.10
251	251.72

In [18]:

y_pred_proba.var().head()

Out[18]:

	target
287	362.869338
134	271.792914
213	289.466582
82	125.589059
251	479.705452

1.2 simple evaluation workflow for probabilistic predictions¶

for simple evaluation:

split the data into train/test set
make predictions of either type for test features
compute metric on test set, comparing test predictions to hend out test labels

Note:

metrics will compare tabular ground truth to probabilistic prediction
the metric will needs to be of a compatible type, e.g., for proba predictions

In [19]:

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from skpro.metrics import CRPS
from skpro.regression.residual import ResidualDouble

# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# step 2: specifying the regressor
# example - linear regression for mean prediction
# random forest for variance prediction
reg_mean = LinearRegression()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)

# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)

# step 4: predicting labels on new data
y_pred_proba = reg_proba.predict_proba(X_test)

# step 5: specifying evaluation metric
metric = CRPS()

# step 6: evaluat metric, compare predictions to actuals
metric(y_test, y_pred_proba)

Out[19]:

29.341032127400844

how do we know that metric is of right type? Via scitype:y_pred tag

In [20]:

metric.get_tags()
# scitype:y_pred is pred_proba - for proba predictions

Out[20]:

{'estimator_type': 'estimator',
 'object_type': 'metric',
 'reserved_params': ['multioutput', 'score_average'],
 'scitype:y_pred': 'pred_proba',
 'lower_is_better': True}

how do we find metrics for a prediction type?

In [21]:

from skpro.registry import all_objects

all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")

Out[21]:

	name	object	scitype:y_pred
0	CRPS	<class 'skpro.metrics._classes.CRPS'>	pred_proba
1	ConstraintViolation	<class 'skpro.metrics._classes.ConstraintViola...	pred_interval
2	EmpiricalCoverage	<class 'skpro.metrics._classes.EmpiricalCovera...	pred_interval
3	LinearizedLogLoss	<class 'skpro.metrics._classes.LinearizedLogLo...	pred_proba
4	LogLoss	<class 'skpro.metrics._classes.LogLoss'>	pred_proba
5	PinballLoss	<class 'skpro.metrics._classes.PinballLoss'>	pred_quantiles
6	SquaredDistrLoss	<class 'skpro.metrics._classes.SquaredDistrLoss'>	pred_proba

extra note: quantile metrics can be applied to interval predictions as well

more details on metrics below

1.3 diagnostic visualisations¶

some useful diagnostic visualisations: variants of crossplots for probabilistic predictions

A. crossplot ground truth vs prediction intervals.

Works with both proba and interval predictions.

What to look for: intervals shouhld cut through the x = y line (green points)

In [22]:

from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

Out[22]:

<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>

In [23]:

from skpro.utils.plotting import plot_crossplot_interval

y_pred_interval = reg_proba.predict_interval(X_test, coverage=0.9)
plot_crossplot_interval(y_test, y_pred_interval)

Out[23]:

<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>

B. crossplot residuals vs predictive standard deviation

Works with both proba and variance predictions.

What to look for: should be close to a line, high linear correlation

In [24]:

from skpro.utils.plotting import plot_crossplot_std

plot_crossplot_std(y_test, y_pred_proba)

Out[24]:

<Axes: xlabel='Absolute errors $|y_i - \\widehat{y}_i|$', ylabel='Predictive standard deviation of $\\widehat{y}_i$'>

In [25]:

from skpro.utils.plotting import plot_crossplot_std

y_pred_var = reg_proba.predict_var(X_test)
plot_crossplot_std(y_test, y_pred_var)

Out[25]:

<Axes: xlabel='Absolute errors $|y_i - \\widehat{y}_i|$', ylabel='Predictive standard deviation of $\\widehat{y}_i$'>

C. crossplot ground truth vs loss values

Loss and prediction type should agree.

What to look for: association between accuracy and ground truth value

Diagnostic of which values we can predict more accurately,

e.g., to inform modelling or identify unusual outliers

In [26]:

from skpro.utils.plotting import plot_crossplot_loss

crps_metric = CRPS()
plot_crossplot_loss(y_test, y_pred_proba, crps_metric)

Out[26]:

<Axes: title={'center': 'mean CRPS: 29.34 +/- 1.97 sterr of mean'}, xlabel='Correct label $y_i$', ylabel='CRPS($y_i$, $\\widehat{y}_i$)'>

1.4 `skpro` objects - `scikit-base` interface, searching for regressors and metrics¶

skpro objects - skbase interface points get_tags, get_params/set_params
searching estimators and metrics via all_objects

1.4.1 primer on `skpro` object interface ¶

metrics and estimators are first-class citizens in skpro, with a scikit-base compatible interface

In [27]:

# example object 1: CRPS metric
from skpro.metrics import CRPS

crps_metric = CRPS()

# example object 2: ResidualDouble regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from skpro.regression.residual import ResidualDouble

reg_mean = LinearRegression()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)

e.g., all have get_tags interface

In [28]:

crps_metric.get_tags()

Out[28]:

{'estimator_type': 'estimator',
 'object_type': 'metric',
 'reserved_params': ['multioutput', 'score_average'],
 'scitype:y_pred': 'pred_proba',
 'lower_is_better': True}

In [29]:

reg_proba.get_tags()

Out[29]:

{'estimator_type': 'regressor_proba',
 'object_type': 'regressor_proba',
 'capability:multioutput': False,
 'capability:missing': True}

the tag object_type indicates the type of object, e.g., metric or proba regressor

all objects also have the get_params/set_params interface known from scikit-learn

= reading or setting hyper-parameters

get_params returns dict {paramname: paramvalue}; set_params writes it

In [30]:

crps_metric.get_params()

Out[30]:

{'multioutput': 'uniform_average', 'multivariate': False}

composite objects have the nested param interface, keys componentname__paramname

In [31]:

# note that reg_proba has components LinearRegression and RandomForestaregressor
# each with their own parameters
reg_proba

Out[31]:

ResidualDouble(estimator=LinearRegression(),
               estimator_resid=RandomForestRegressor())

Please rerun this cell to show the HTML repr or trust the notebook.

so reg_proba will have parameters coming from itself and either component:

In [32]:

reg_proba.get_params()

Out[32]:

{'cv': None,
 'distr_loc_scale_name': None,
 'distr_params': None,
 'distr_type': 'Normal',
 'estimator': LinearRegression(),
 'estimator_resid': RandomForestRegressor(),
 'min_scale': 1e-10,
 'residual_trafo': 'absolute',
 'use_y_pred': False,
 'estimator__copy_X': True,
 'estimator__fit_intercept': True,
 'estimator__n_jobs': None,
 'estimator__normalize': 'deprecated',
 'estimator__positive': False,
 'estimator_resid__bootstrap': True,
 'estimator_resid__ccp_alpha': 0.0,
 'estimator_resid__criterion': 'squared_error',
 'estimator_resid__max_depth': None,
 'estimator_resid__max_features': 1.0,
 'estimator_resid__max_leaf_nodes': None,
 'estimator_resid__max_samples': None,
 'estimator_resid__min_impurity_decrease': 0.0,
 'estimator_resid__min_samples_leaf': 1,
 'estimator_resid__min_samples_split': 2,
 'estimator_resid__min_weight_fraction_leaf': 0.0,
 'estimator_resid__n_estimators': 100,
 'estimator_resid__n_jobs': None,
 'estimator_resid__oob_score': False,
 'estimator_resid__random_state': None,
 'estimator_resid__verbose': 0,
 'estimator_resid__warm_start': False}

further common interface points are get_config, set_config, and get_fitted_params (only fittable estimators)

1.4.2 searching for regressors and metrics ¶

as first-class citizens, all objects in skpro are indexed via the registry utility all_objects.

To find probabilistic supervised regressors, use all_objects with the type regressor_proba:

In [33]:

from skpro.registry import all_objects

all_objects("regressor_proba", as_dataframe=True).head()

Out[33]:

	name	object
0	BaggingRegressor	<class 'skpro.regression.ensemble.BaggingRegre...
1	BootstrapRegressor	<class 'skpro.regression.bootstrap.BootstrapRe...
2	GridSearchCV	<class 'skpro.model_selection._tuning.GridSear...
3	Pipeline	<class 'skpro.regression.compose._pipeline.Pip...
4	RandomizedSearchCV	<class 'skpro.model_selection._tuning.Randomiz...

a full list can also be found in the online API reference.

for metrics, as seen above:

In [34]:

from skpro.registry import all_objects

all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")

Out[34]:

	name	object	scitype:y_pred
0	CRPS	<class 'skpro.metrics._classes.CRPS'>	pred_proba
1	ConstraintViolation	<class 'skpro.metrics._classes.ConstraintViola...	pred_interval
2	EmpiricalCoverage	<class 'skpro.metrics._classes.EmpiricalCovera...	pred_interval
3	LinearizedLogLoss	<class 'skpro.metrics._classes.LinearizedLogLo...	pred_proba
4	LogLoss	<class 'skpro.metrics._classes.LogLoss'>	pred_proba
5	PinballLoss	<class 'skpro.metrics._classes.PinballLoss'>	pred_quantiles
6	SquaredDistrLoss	<class 'skpro.metrics._classes.SquaredDistrLoss'>	pred_proba

all tags can be printed by the all_tags utility:

In [35]:

# all tags applicable to metrics
from skpro.registry import all_tags

all_tags("metric", as_dataframe=True)

Out[35]:

	name	scitype	type	description
0	lower_is_better	metric	bool	whether lower (True) or higher (False) is better
1	scitype:y_pred	metric	str	expected input type for y_pred in performance ...

In [36]:

# all tags applicable to probabilistic regressors
from skpro.registry import all_tags

all_tags("regressor_proba", as_dataframe=True)

Out[36]:

	name	scitype	type	description
0	capability:missing	regressor_proba	bool	whether estimator supports missing values
1	capability:multioutput	regressor_proba	bool	whether estimator supports multioutput regression

filtering in search can be done with the filter_tags argument in all_objects, see docstring:

In [37]:

from skpro.registry import all_objects

# "retrieve all genuinely probabilistic loss functions"
all_objects("metric", as_dataframe=True, filter_tags={"scitype:y_pred": "pred_proba"})

Out[37]:

	name	object
0	CRPS	<class 'skpro.metrics._classes.CRPS'>
1	LinearizedLogLoss	<class 'skpro.metrics._classes.LinearizedLogLo...
2	LogLoss	<class 'skpro.metrics._classes.LogLoss'>
3	SquaredDistrLoss	<class 'skpro.metrics._classes.SquaredDistrLoss'>

2. Prediction types, metrics, benchmarking ¶

This section gives more details on:

different prediction types, including a methodological primer
the API of metrics to compare probabilistic predictions to non-probabilistic actuals
utilities for batch benchmarking of estimators and metrics

2.1 Probabilistic predictions - methodological primer ¶

readers familir with, or less interested in theory, may like to skip section 2.1

In supervised learning - probabilistic or not:

we fit estimator to i.i.d samples $(X_1, Y_1), \dots, (X_N, Y_N) \sim (X_*, Y_*)$
and want to predict $y$ given $x$ accurately, for $(x, y) \sim (X_*, Y_*)$

Let $y$ be the (true) value, for an observed feature $x$

(we consider $y$ a random variable)

Name	param	prediction/estimate of	`skpro`
point prediction		conditional expectation $\mathbb{E}[y\|x]$	`predict`
variance prediction		conditional variance $Var[y\|x]$	`predict_var`
quantile prediction	$\alpha\in (0,1)$	$\alpha$-quantile of $y\|x$	`predict_quantiles`
interval prediction	$c\in (0,1)$	$[a,b]$ s.t. $P(a\le y \le b\| x) = c$	`predict_interval`
distribution prediction		the law/distribution of $y\|x$	`predict_proba`

More formal details & intuition:¶

let's consider the toy example again

In [38]:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)

In [39]:

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from skpro.regression.residual import ResidualDouble

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)


reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)

reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_new)

a "point prediction" is a prediction/estimate of the conditional expectation $\mathbb{E}[y|x]$.\

Intuition: "out of many repetitions/worlds, this value is the arithmetic average of all observations".

In [40]:

# if y_pred_proba were *true*, here's how many repetitions would look like:

# repeating this line is "one repetition"
y_pred_proba.sample().head()

Out[40]:

	target
421	216.595619
81	122.436986
326	147.317343
105	98.226950
416	171.539410

In [41]:

many_samples = y_pred_proba.sample(100)
many_samples

Out[41]:

		target
0	421	174.503233
	81	128.299773
	326	174.547102
	105	85.523671
	416	166.025376
...	...	...
99	399	208.666436
	109	188.566325
	22	76.930673
	174	156.354496
	106	96.456699

11100 rows × 1 columns

In [42]:

# "doing many times and taking the mean" -> usual point prediction
mean_prediction = many_samples.groupby(level=1, sort=False).mean()
mean_prediction.head()

Out[42]:

	target
421	213.432031
81	126.286757
326	173.065740
105	100.126654
416	178.876668

In [43]:

# if we would do this infinity times instead of 100:
y_pred_proba.mean().head()

Out[43]:

	target
421	213.57
81	122.14
326	172.48
105	98.97
416	180.31

a "variance prediction" is a prediction/estimate of the conditional expectation $Var[y|x]$.\

Intuition: "out of many repetitions/worlds, this value is the average squared distance of the observation to the perfect point prediction".

In [44]:

# same as above - take many samples, and then compute element-wise statistics
var_prediction = many_samples.groupby(level=1, sort=False).var()
var_prediction.head()

Out[44]:

	target
421	237.798822
81	262.881415
326	275.523556
105	401.240249
416	267.851337

In [45]:

# e.g., predict_var should give the same result as infinite large sample's variance
y_pred_proba.var().head()

Out[45]:

	target
421	289.154099
81	289.154099
326	289.154099
105	289.154099
416	289.154099

a "quantile prediction", at quantile point $\alpha\in (0,1)$ is a prediction/estimate of the $\alpha$-quantile of $y'|y$, i.e., of $F^{-1}_{y|x}(\alpha)$, where $F^{-1}$ is the (generalized) inverse cdf = quantile function of the random variable y|x.\

Intuition: "out of many repetitions/worlds, a fraction of exactly $\alpha$ will have equal or smaller than this value."

an "interval prediction" or "predictive interval" with (symmetric) coverage $c\in (0,1)$ is a prediction/estimate pair of lower bound $a$ and upper bound $b$ such that $P(a\le y \le b| x) = c$ and $P(y \gneq b| x) = P(y \lneq a| x) = (1 - c) /2$.\

Intuition: "out of many repetitions/worlds, a fraction of exactly $c$ will be contained in the interval $[a,b]$, and being above is equally likely as being below".

(similar - exercise left to the reader)

a "distribution prediction" or "full probabilistic prediction" is a prediction/estimate of the distribution of $y|x$, e.g., "it's a normal distribution with mean 42 and variance 1".\

Intuition: exhaustive description of the generating mechanism of many repetitions/worlds.

note: the true distribution is unknown, and not accessible easily!

y_pred_proba is a distribution, but in general not equal to the true one!

that is, there are:

true distribution y_pred_proba_true - unknown and unknowable but estimable
y_pred_proba - our guess at y_pred_proba_true
the actual data y_true is one y_pred_proba_true.sample()

predict produces guess of y_pred_proba_true.mean()
predict_var produces guess of y_pred_proba_true.var()
predict_quantiles([0.05, 0.5, 0.95]) produces guess of y_pred_proba_true.quantiles([0.05, 0.5, 0.95])
predict_proba produces guess of y_pred_proba_true

the guesses are algorithm specific, and some algorithms are more accurate than others, given data

2.2 probabilistic metrics - details ¶

General usage pattern same as for sklearn metrics:

get some actuals and predictions
specify the metric - similar to estimator specs
plug the actuals and predictions into metric to get metric values

but: need to use dedicated metric for probabilistic predictions

ground truth: y_true samples
prediction e.g., y_predict_proba, y_predict_interval
so, match metric with type of prediction!
- metric(y_true: 2D pd.DataFrame, y_pred: proba_prediction_type) -> float

Recall methods available for all probabilistic regressors:

predict_interval produces interval predictions. Argument coverage (nominal interval coverage) must be provided.
predict_quantiles produces quantile predictions. Argument alpha (quantile values) must be provided.
predict_var produces variance predictions. Same args as predict.
predict_proba produces full distributional predictions. Same args as predict.

Name	param	prediction/estimate of	`skpro`
point prediction		conditional expectation $\mathbb{E}[y\|x]$	`predict`
variance prediction		conditional variance $Var[y\|x]$	`predict_var`
quantile prediction	$\alpha\in (0,1)$	$\alpha$-quantile of $y\|x$	`predict_quantiles`
interval prediction	$c\in (0,1)$	$[a,b]$ s.t. $P(a\le y \le b\| x) = c$	`predict_interval`
distribution prediction		the law/distribution of $y\|x$	`predict_proba`

let's produce some probabilistic predictions!

In [46]:

# 1. get some actuals and predictions
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# actuals = y_test

In [47]:

from sklearn.ensemble import RandomForestRegressor

from skpro.regression.residual import ResidualDouble

reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)

reg_proba.fit(X_train, y_train)

# use any of the probabilistic methods, we have seen this
y_pred_int = reg_proba.predict_interval(X_test, coverage=0.95)
y_pred_q = reg_proba.predict_quantiles(X_test, alpha=[0.05, 0.95])
y_pred_proba = reg_proba.predict_proba(X_test)

recall, all have their own output format:

In [48]:

y_pred_int  # lower/upper intervals

Out[48]:

	target
	0.95
	lower	upper
104	112.79411	180.58589
334	63.02411	130.81589
271	116.11411	183.90589
342	155.14411	222.93589
243	51.14411	118.93589
...	...	...
9	79.16411	146.95589
46	86.54411	154.33589
154	136.21411	204.00589
402	133.12411	200.91589
284	145.72411	213.51589

111 rows × 2 columns

In [49]:

y_pred_q  # quantiles

Out[49]:

	target
	0.05	0.95
104	118.243673	175.136327
334	68.473673	125.366327
271	121.563673	178.456327
342	160.593673	217.486327
243	56.593673	113.486327
...	...	...
9	84.613673	141.506327
46	91.993673	148.886327
154	141.663673	198.556327
402	138.573673	195.466327
284	151.173673	208.066327

111 rows × 2 columns

In [50]:

y_pred_proba  # sktime/skpro BaseDistribution

Out[50]:

Normal(columns=Index(['target'], dtype='object'),
       index=Index([104, 334, 271, 342, 243, 120, 183, 333, 315, 357,
       ...
       313, 111, 137, 198, 413,   9,  46, 154, 402, 284],
      dtype='int64', length=111),
       mu=array([[146.69],
       [ 96.92],
       [150.01],
       [189.04],
       [ 85.04],
       [ 91.04],
       [202.31],
       [226.03],
       [ 90.7 ],
       [153.92],
       [263.72],
       [139.12],
       [ 81.82],
       [138.16],
       [144.98],
       [ 90.28],
       [229.32],
       [200.46],
       [ 83.39],
       [196.02],...
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897],
       [17.29413897]]))

Please rerun this cell to show the HTML repr or trust the notebook.

we now need to apply a suitable metric, metric(y_test, y_pred)

IMPORTANT: sequence matters, y_test first; y_pred has very different type!

In [51]:

# 2. specify metric
# CRPS = continuous ranked probability score, for distribution predictions
from skpro.metrics import CRPS

crps = CRPS()

# 3. evaluate metric
crps(y_test, y_pred_proba)

Out[51]:

40.363339883507926

how do we find a metric that fits the prediction type?

answer: metrics are tagged

important tag: scitype:y_pred

"pred_proba" - distributional, can applied to distributions, predict_proba output
"pred_quantiles" - quantile forecast metric, can be applied to quantile predictions, interval predictions, distributional predictions
- applicable to predict_quantiles, predict_interval, predict_proba outputs
"pred_interval" - interval forecast metric, can be applied to interval predictions, distributional predictions
- applicable to predict_interval, predict_proba outputs

In [52]:

crps.get_tags()

Out[52]:

{'estimator_type': 'estimator',
 'object_type': 'metric',
 'reserved_params': ['multioutput', 'score_average'],
 'scitype:y_pred': 'pred_proba',
 'lower_is_better': True}

listing metrics with the tag, filtering for probabilistic tags:

(let's try to find a quantile prediction metric!)

In [53]:

from skpro.registry import all_objects

all_objects(
    "metric",
    as_dataframe=True,
    return_tags="scitype:y_pred",
)

Out[53]:

	name	object	scitype:y_pred
0	CRPS	<class 'skpro.metrics._classes.CRPS'>	pred_proba
1	ConstraintViolation	<class 'skpro.metrics._classes.ConstraintViola...	pred_interval
2	EmpiricalCoverage	<class 'skpro.metrics._classes.EmpiricalCovera...	pred_interval
3	LinearizedLogLoss	<class 'skpro.metrics._classes.LinearizedLogLo...	pred_proba
4	LogLoss	<class 'skpro.metrics._classes.LogLoss'>	pred_proba
5	PinballLoss	<class 'skpro.metrics._classes.PinballLoss'>	pred_quantiles
6	SquaredDistrLoss	<class 'skpro.metrics._classes.SquaredDistrLoss'>	pred_proba

PinballLoss is a quantile forecast metric:

In [54]:

from skpro.metrics import PinballLoss

pinball_loss = PinballLoss()

pinball_loss(y_test, y_pred_q)

Out[54]:

13.648123260286354

... this is by default an average (grand average, float)

averages over samples in y_pred / y_test (rows)
averages over variables (columns)
average over alpha values, quantile points

what if we don't want these averages?

variable (column) averaging is controlled by the multioutput arg.
- "raw_values" prevents averaging, "uniform_average" computes arithmetic mean.
quantile points (alpha) or coverage (coverage) is controlled by score_average arg
evaluation by row via the evaluate_by_index method
- can be useful for diagnostics or statistical tests

In [55]:

# Example 1: Pinball loss by quantile point
loss_multi = PinballLoss(score_average=False)
loss_multi(y_test, y_pred_q)

Out[55]:

0.05    13.133967
0.95    14.162279
Name: 0, dtype: float64

In [56]:

# Example 2: CRPS by test sample index
crps.evaluate_by_index(y_test, y_pred_proba)

Out[56]:

	target
104	41.946574
334	16.320990
271	14.728135
342	6.761400
243	28.452058
...	...
9	187.182827
46	59.803051
154	18.026285
402	4.063702
284	15.229778

111 rows × 1 columns

Caveat: not every metric is an average over time points, e.g., RMSE

In this case, evaluate_by_index computes jackknife pseudo-samples

(for mean statistics, jackknife pseudo-samples are equal to individual samples)

2.3 Benchmark evaluation of probabilistic regressors ¶

for quick evaluation and benchmarking,

the benchmarking.evaluate utility can be used:

In [57]:

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.residual import ResidualDouble

# 1. specify dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

# 2. specify estimator
estimator = ResidualDouble(LinearRegression())

# 3. specify cross-validation schema
cv = KFold(n_splits=3)

# 4. specify evaluation metric
crps = CRPS()

# 5. evaluate - run the benchmark
results = evaluate(estimator=estimator, X=X, y=y, cv=cv, scoring=crps)

# results are pd.DataFrame
# each row is one repetition of the cross-validation on one fold fit/predict/evaluate
# columns report performance, runtime, and other optional information (see docstring)
results

Out[57]:

	test_CRPS	fit_time	pred_time	len_y_train
0	31.783451	0.002778	0.001045	294
1	33.574329	0.003086	0.001094	295
2	29.909655	0.003807	0.001278	295

3. Advanced composition patterns ¶

we introduce a number of composition patterns available in skpro:

reducer-wrappers that turn sklearn regressors into probabilistic ones
pipelines of sklearn transformers with skpro regressors
tuning skpro probabilistic regressors via grid/random search, minimizing a probabilistic metric
ensembling multiple skpro probabilistic regressors

data used in this section:

In [58]:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

evaluation metric used in this section:

In [59]:

crps = CRPS()

3.1 Reducers to turn `sklearn` regressors probabilistic ¶

there are many common algorithms that turn a non-probabilistic tabular regressor probabilistic

formally, this is a type of "reduction" - of probabilistic supervised tabular to non-probabilistic supervised tabular

Examples:

predicting variance equal to training residual variance - ResidualDouble with standard settings
- or other unconditional distribution estimate for residuals
"squaring the residual" two-step prediction - ResidualDouble
boostrap prediction intervals - BootstrapRegressor
conformal prediction intervals - contributions appreciated :-)
natural gradient boosting aka NGBoost - contributions appreciated :-)

3.1.1 constant variance prediction ¶

In [60]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# estimator specification - use any sklearn regressor for reg_mean
reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV
# note - in-sample predictions will usually underestimate the variance!

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

Out[60]:

36.25453698363302

In [61]:

from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

Out[61]:

<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>

3.1.2 two-step residual prediction ¶

In [62]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# estimator specification - use any sklearn regressor for reg_mean and reg_resid
reg_mean = RandomForestRegressor()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, estimator_resid=reg_resid, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

Out[62]:

36.860503015336164

In [63]:

from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

Out[63]:

<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>

3.1.3 bootstrap prediction intervals ¶

In [64]:

from sklearn.linear_model import LinearRegression

from skpro.regression.bootstrap import BootstrapRegressor

# estimator specification - use any sklearn regressor for reg_mean
reg_mean = LinearRegression()
reg_proba = BootstrapRegressor(reg_mean, n_bootstrap_samples=100)

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

Out[64]:

43.13652636195608

In [65]:

from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

Out[65]:

<Axes: xlabel='Correct label $y_i$', ylabel='Prediction interval $\\widehat{y}_i$'>

3.2 Pipelines of `skpro` regressor and `sklearn` transformers ¶

skpro regressors can be pipelined with sklearn transformers, using the skpro pipeline.

This ensure presence of predict_proba etc in the pipeline object.

The syntax is exactly the same as for sklearn's pipeline.

In [66]:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [67]:

from sklearn.impute import SimpleImputer as Imputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

from skpro.regression.compose import Pipeline
from skpro.regression.residual import ResidualDouble

# estimator specification
reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)

# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
    steps=[
        ("imputer", Imputer()),  # an sklearn transformer
        ("scaler", MinMaxScaler()),  # an sklearn transformer
        ("regressor", reg_proba),  # an skpro regressor
    ]
)

In [68]:

pipe

Out[68]:

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler()),
                ('regressor', ResidualDouble(estimator=LinearRegression()))])

Please rerun this cell to show the HTML repr or trust the notebook.

In [69]:

# the pipeline behaves as any skpro regressor
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X=X_test)
y_pred_proba = pipe.predict_proba(X=X_test)

the pipeline provides the familiar nested get_params, set_params interface:

nested parameters are keyed componentname__parametername

In [70]:

pipe.get_params()

Out[70]:

{'steps': [('imputer', SimpleImputer()),
  ('scaler', MinMaxScaler()),
  ('regressor', ResidualDouble(estimator=LinearRegression()))],
 'imputer': SimpleImputer(),
 'scaler': MinMaxScaler(),
 'regressor': ResidualDouble(estimator=LinearRegression()),
 'imputer__add_indicator': False,
 'imputer__copy': True,
 'imputer__fill_value': None,
 'imputer__missing_values': nan,
 'imputer__strategy': 'mean',
 'imputer__verbose': 'deprecated',
 'scaler__clip': False,
 'scaler__copy': True,
 'scaler__feature_range': (0, 1),
 'regressor__cv': None,
 'regressor__distr_loc_scale_name': None,
 'regressor__distr_params': None,
 'regressor__distr_type': 'Normal',
 'regressor__estimator': LinearRegression(),
 'regressor__estimator_resid': None,
 'regressor__min_scale': 1e-10,
 'regressor__residual_trafo': 'absolute',
 'regressor__use_y_pred': False,
 'regressor__estimator__copy_X': True,
 'regressor__estimator__fit_intercept': True,
 'regressor__estimator__n_jobs': None,
 'regressor__estimator__normalize': 'deprecated',
 'regressor__estimator__positive': False}

pipelines can also be created via simple lists of estimators,

in this case names are generated automatically:

In [71]:

# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
    steps=[
        Imputer(),  # an sklearn transformer
        MinMaxScaler(),  # an sklearn transformer
        reg_proba,  # an skpro regressor
    ]
)

3.3 Tuning of `skpro` regressors via grid and random search ¶

skpro provides grid and random search tuners to tune arbitrary probabilistic regressors,

using probabilistic metrics. Besides this, they function as the sklearn tuners do.

In [72]:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [73]:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

from skpro.metrics import CRPS
from skpro.model_selection import GridSearchCV
from skpro.regression.residual import ResidualDouble

# cross-validation specification for tuner
cv = KFold(n_splits=3)

# estimator to be tuned
estimator = ResidualDouble(LinearRegression())

# tuning grid - do we fit an intercept in the linear regression?
param_grid = {"estimator__fit_intercept": [True, False]}

# metric to be optimized
crps_metric = CRPS()

# specification of the grid search tuner
gscv = GridSearchCV(
    estimator=estimator,
    param_grid=param_grid,
    cv=cv,
    scoring=crps_metric,
)

In [74]:

gscv

Out[74]:

GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=False),
             estimator=ResidualDouble(estimator=LinearRegression()),
             param_grid={'estimator__fit_intercept': [True, False]},
             scoring=CRPS())

Please rerun this cell to show the HTML repr or trust the notebook.

the grid search tuner behaves like any skpro probabilistic regressor:

In [75]:

gscv.fit(X_train, y_train)
y_pred = gscv.predict(X_test)
y_pred_proba = gscv.predict_proba(X_test)

random search is similar, except that instead of a grid a parameter sampler should be specified:

In [76]:

from skpro.model_selection import RandomizedSearchCV

# only difference to GridSearchCV is the param_distributions argument

# specification of the random search parameter sampler
param_distributions = {"estimator__fit_intercept": [True, False]}

# specification of the random search tuner
rscv = RandomizedSearchCV(
    estimator=estimator,
    param_distributions=param_distributions,
    cv=cv,
    scoring=crps_metric,
)

3.4 Bagging/mixture ensemble of probabilistic regressors ¶

Classical bagging does the following, for a wrapped estimator:

In fit:

subsample rows and/or columns of X, y to X_subs, y_subs
fit clone of wrapped estimator to X_subs, y_subs
Repeat 1-2 n_estimators times, store that many fitted clones.

In predict, for X_test:

for all fitted clones, obtain predictions on X_test - these are distributions
return the uniform mixture of these distributions, per test sample

In [77]:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [78]:

from sklearn.linear_model import LinearRegression

from skpro.regression.ensemble import BaggingRegressor
from skpro.regression.residual import ResidualDouble

reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)

ens = BaggingRegressor(reg_proba, n_estimators=10)
ens.fit(X_train, y_train)

y_pred = ens.predict_proba(X_test)

In [79]:

# y_pred is a mixture distribution!
str(y_pred)

Out[79]:

"Mixture(columns=Index(['target'], dtype='object'),\n        distributions=[Normal(columns=Index(['target'], dtype='object'),\n                              index=Index([367, 248, 134, 148,  21, 310,  37, 305,  23, 373,\n       ...\n       365, 204, 309, 379,   7, 440, 178, 160, 119, 441],\n      dtype='int64', length=111),\n                              mu=array([[270.91496133],\n       [227.0209857 ],\n       [153.30255219],\n       [133.24435057],\n       [108.55506881],\n       [206.51166703],\n       [153.83348539],\n       [120.5...\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484],\n       [45.96784484]]))],\n        index=Index([367, 248, 134, 148,  21, 310,  37, 305,  23, 373,\n       ...\n       365, 204, 309, 379,   7, 440, 178, 160, 119, 441],\n      dtype='int64', length=111))"

In [80]:

[type(x) for x in y_pred.distributions]

Out[80]:

[skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal,
 skpro.distributions.normal.Normal]

4. Extension guide - implementing your own probabilistic regressor ¶

skpro is meant to be easily extensible, for direct contribution to skpro as well as for local/private extension with custom methods.

To get started:

Follow the "implementing estimator" developer guide
Use the probabilistic regressor template to get started

Read through the probabilistic regression extension template - this is a python file with todo blocks that mark the places in which changes need to be added.
Copy the proba regressor extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the skpro or affiliated repository (if contributed extension), inside skpro.regression; rename the file and update the file docstring appropriately.
Address the "todo" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in __init__, _fit, and at least one of the probabilistic prediction methods, preferably _predict_proba (for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.
To test your estimator manually: import your estimator and run it in the worfklows in Section 1; then use it in the compositors in Section 3.
To test your estimator automatically: call skpro.utils.check_estimator on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the get_test_params method, according to the extension template.

In case of direct contribution to skpro or one of its affiliated packages, additionally:

Add yourself as an author to the code, and to the CODEOWNERS for the new estimator file(s).
Create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.
In the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.
Before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project.

5. Summary ¶

skpro is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with scikit-learn and scikit-base interface specifications.
skpro comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as scikit-learn and individual algorithm libraries.
skpro is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles.

Credits:¶

noteook creation: fkiraly

skpro: https://github.com/sktime/skpro/blob/main/CONTRIBUTORS.md

skpro introduction notebook¶

1. Basic probabilistic supervised regression workflows ¶

1.1 basic deployment workflow¶

1.1.1 distribution predictions¶

1.1.2 interval predictions¶

1.1.3 quantile predictions¶

1.1.4 mean and variance predictions¶

1.2 simple evaluation workflow for probabilistic predictions¶

1.3 diagnostic visualisations¶

1.4 skpro objects - scikit-base interface, searching for regressors and metrics¶

1.4.1 primer on skpro object interface ¶

1.4.2 searching for regressors and metrics ¶

2. Prediction types, metrics, benchmarking ¶

2.1 Probabilistic predictions - methodological primer ¶

More formal details & intuition:¶

2.2 probabilistic metrics - details ¶

2.3 Benchmark evaluation of probabilistic regressors ¶

3. Advanced composition patterns ¶

3.1 Reducers to turn sklearn regressors probabilistic ¶

3.1.1 constant variance prediction ¶

3.1.2 two-step residual prediction ¶

3.1.3 bootstrap prediction intervals ¶

3.2 Pipelines of skpro regressor and sklearn transformers ¶

3.3 Tuning of skpro regressors via grid and random search ¶

3.4 Bagging/mixture ensemble of probabilistic regressors ¶

4. Extension guide - implementing your own probabilistic regressor ¶

5. Summary¶

Credits:¶

1.4 `skpro` objects - `scikit-base` interface, searching for regressors and metrics¶

1.4.1 primer on `skpro` object interface ¶

3.1 Reducers to turn `sklearn` regressors probabilistic ¶

3.2 Pipelines of `skpro` regressor and `sklearn` transformers ¶

3.3 Tuning of `skpro` regressors via grid and random search ¶

5. Summary ¶