#!/usr/bin/env python
# coding: utf-8
# ## time-to-event modelling and survival prediction
# **Set-up instructions:** On binder, this should run out-of-the-box.
#
# To run locally instead, ensure that `skpro` with basic dependency requirements is installed in your python environment.
# ---
# `skpro` provides a unified interface to time-to-event prediction models, also known as survival prediction models.
#
# **Time-to-event prediction** is a form of probabilistic regression where **labels can be "censored"**, i.e., of the form "time is t or later" instead of exact observations.
#
# **Section 1** provides an overview of the basic **time-to-event prediction workflows** supported by `skpro`.
#
# **Section 2** showcases **performance metrics and benchmarking** for time-to-event prediction with censored data.
#
# **Section 3** discusses **advanced composition patterns**, including various ways to leverage `sklearn` regressors for time-to-event prediction with censored data.
#
# **Section 4** gives an introduction to how to write **custom estimators** compliant with the `skpro` interface.
# In[50]:
# hide warnings
import warnings
warnings.filterwarnings("ignore")
# ## 1. Basic survival prediction interface
# In this section:
#
# * explanation of censored time-to-event data
# * `skpro` time-to-event/survival prediction interface
# * metrics, evaluation
# ### 1.1 data representation, censoring
# Survival prediction or time-to-event prediction can be seen a generalization of probabilistic supervised learning.
#
# Each sample consists of:
#
# * a feature vector, row of a data frame
# * a label, which can be an exact time of occurrence, or a statement about "time was t or later"
# In[51]:
# simulated toy dataset, lung cancer survival times
import numpy as np
# demographics - age and smoker yes/no
age = np.random.uniform(low=20, high=100, size=50)
smoker = np.random.binomial(1, 0.3, size=50)
# actual survival time
scale = 200 / (0.5 * age + 30 * smoker)
survival = scale * np.random.weibull(1, size=50)
# patients are observed only for 5 years
# if they surviva 5 years, we know they survived 5 years, but not exact time of death
censored = survival > 5
observation = np.minimum(survival, 5)
# `skpro` represents this information in an `sklearn`-like interface:
# In[52]:
import pandas as pd
# features
X = pd.DataFrame({"age": age, "smoker": smoker})
# time of survival or censoring
y = pd.DataFrame({"time": observation})
# indicator whether event was observed or censored
# censored = 1/True, observed = 0/False
# variable names should be the same as for y
C = pd.DataFrame({"time": censored})
# In[53]:
X.head()
# In[54]:
y.head()
# In[55]:
C.head()
# ### 1.2 basic survival prediction workflow
# survival prediction is the task:
#
# * given censored time-to-event labels and features, `X`, `y`, `C`
# * learn a model that can predict `y` if it were uncensored, i.e., the true event time
# * the prediction should take the form of a survival function or probability distribution
# `skpro` survival predictors extend the interface of probabilistic regressors:
#
# * `fit(X, y, C=None)`, with `X`, `y`, `C` as above; if `C=None`, all observations are uncensored
# * `predict(X_test)` for mean survival time predictions
# * `predict_proba(X_test)` for distributional predictions
#
# Other prediction methods - `predict_interval`, `predict_quantiles`, `predict_var` - also generalize the same way.
# Because `C` is optional, and means "uncensored" if not passed, all survival prediction models can be used as supervised probabilistic regressors.
#
# Using probabilistic regressors as survival models is similarly possible, to be revisited later.
# Basic deployment workflow:
# In[56]:
from sklearn.model_selection import train_test_split
from skpro.survival.coxph import CoxPH
# step 1: data specification
# X, y, C, as above
X_train, X_new, y_train, _, C_train, _ = train_test_split(X, y, C)
# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model_cox = CoxPH()
# step 3: fitting the model to training data
surv_model_cox.fit(X_train, y_train, C_train)
# step 4: predicting labels on new data
# full distribution prediction
y_pred_proba_cox = surv_model_cox.predict_proba(X_new)
# In[57]:
# mean predicted survival time
y_pred_proba_cox.mean().head()
# In[58]:
# plot of survival functions
y_pred_proba_cox.iloc[range(5)].plot("surv")
# In[59]:
# plotting survival functions in one figure, smokers in red
from matplotlib.pyplot import subplots
_, ax = subplots()
for i in range(len(y_pred_proba_cox)):
ax = y_pred_proba_cox.iat[i, 0].plot("surv", ax=ax, color=["b", "r"][smoker[i]])
# ### 1.3 survival prediction with parametric predictive distribution
# example: using an accelerated failure time model with Weibull hazard
#
# same workflow, only using different model:
# In[60]:
from sklearn.model_selection import train_test_split
from skpro.survival.aft import AFTWeibull
# step 1: data specification
# X, y, C, as above
X_train, X_new, y_train, _, C_train, _ = train_test_split(X, y, C)
# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model_aft = AFTWeibull()
# step 3: fitting the model to training data
surv_model_aft.fit(X_train, y_train, C_train)
# step 4: predicting labels on new data
# full distribution prediction
y_pred_proba_aft = surv_model_aft.predict_proba(X_new)
# In[61]:
# plotting survival functions in one figure, smokers in red
from matplotlib.pyplot import subplots
_, ax = subplots()
for i in range(len(y_pred_proba_cox)):
ax = y_pred_proba_aft.iat[i, 0].plot(
"surv", ax=ax, color=["b", "r"][smoker[i]], x_bounds=[0, 5]
)
# hazard functions can be plotted the same way:
# In[62]:
# plot of hazard functions
y_pred_proba_aft.iloc[range(5)].plot("haz", x_bounds=[0, 5])
# In[63]:
# plotting survival functions in one figure, smokers in red
from matplotlib.pyplot import subplots
_, ax = subplots()
for i in range(len(y_pred_proba_aft)):
ax = y_pred_proba_aft.iat[i, 0].plot(
"haz", ax=ax, color=["b", "r"][smoker[i]], x_bounds=[0, 5]
)
# In[64]:
# estimated scale parameter
y_pred_proba_aft.to_df().head()
# In[65]:
# actual Weibull scale parameter to compare
# unknown in a real scenario, but we know since we simulated the data
scale[0:5]
# ### 1.4 finding time-to-event prediction models
# Time-to-event (aka survival) prediction models are probabilistic regressors.
#
# I.e., the type is `regressor_proba`, as for probabilistic regressors.
#
# All probabilistic regressors can be used with for survival prediction as above.
#
# But, only models with the `"capability:survival"` tag set to `True` use the censoring information in `C`.
#
# Other models ignore `C`, which will lead to biased predictions (see section 3 on addressing this).
#
# The `all_objects` utility can be used to retrieve time-to-event prediction models:
# In[66]:
from skpro.registry import all_objects
# time-to-event predictors have "regressor_proba" type, i.e., probabilistic regressors
# the tag "capability:survival" = True indicates models that make use of C
all_objects(
"regressor_proba", as_dataframe=True, filter_tags={"capability:survival": True}
).head()
# ### 1.5 simple evaluation workflow for time-to-event predictions
# for simple evaluation:
#
# 1. split the data into train/test set - including the censoring variable
# 2. make predictions of either type for test features
# 3. compute metric on test set, comparing test predictions to held out test observations,
# including censoring indicsator
#
# Note:
#
# * metrics will compare probabilistic prediction to tabular ground truth and
# censoring indicator
# * the metric needs to be of a compatible type, e.g., for distribution predictions
# * special survival metrics are available to take into account censoring;
# if a non-survival metric is used, the censoring indicator will be ignored
# In[67]:
from sklearn.model_selection import train_test_split
from skpro.metrics import ConcordanceHarrell
from skpro.survival.coxph import CoxPH
# step 1: data specification
X_train, X_test, y_train, y_test, C_train, C_test = train_test_split(X, y, C)
# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model = CoxPH()
# step 3: fitting the model to training data
surv_model.fit(X_train, y_train, C_train)
# step 4: predicting labels on new data
y_pred_proba = surv_model.predict_proba(X_test)
# step 5: specifying evaluation metric
metric = ConcordanceHarrell()
# step 6: evaluate metric, compare predictions to actuals
metric(y_test, y_pred_proba, C_true=C_test)
# how do we know that metric is a genuine survival metric?
#
# Via the `capability:survival` tag:
# In[68]:
metric.get_tags()
# capability:survival is True
# how do we find specialized survival metrics?
# In[69]:
from skpro.registry import all_objects
all_objects("metric", as_dataframe=True, filter_tags={"capability:survival": True})
# All metrics can be applied to censored data:
#
# metrics with the tag being `False` will ignore the censoring variable.
#
# This will lead to bias in general, but can be justified if there is a low amount of censoring.
# ## 2. Benchmarking time-to-event models
# The `benchmarking.evaluate` utility can be used for time-to-event models as well:
# In[70]:
from sklearn.model_selection import KFold
from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import ConcordanceHarrell
from skpro.survival.coxph import CoxPH
# 1. specify dataset
# X, y, C are as above
# 2. specify estimator
estimator = CoxPH()
# 3. specify cross-validation schema
cv = KFold(n_splits=3)
# 4. specify evaluation metric
c_index = ConcordanceHarrell()
# 5. evaluate - run the benchmark
# C needs to be passed here
# this will automatically pass it on the test set to the metric
results = evaluate(estimator=estimator, X=X, y=y, C=C, cv=cv, scoring=c_index)
# results are pd.DataFrame
# each row is one repetition of the cross-validation on one fold fit/predict/evaluate
# columns report performance, runtime, and other optional information (see docstring)
results
# ## 3. Advanced composition patterns
# we introduce a number of composition patterns available in `skpro`:
#
# * reducer-wrappers that turn `sklearn` regressors into survival regressors
# * pipelines of `sklearn` transformers with `skpro` survival regressors
# * tuning `skpro` survival regressors via grid/random search, minimizing a probabilistic metric
# data used in this section:
# In[71]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, C_train, C_test = train_test_split(X, y, C)
# evaluation metric used in this section:
# In[72]:
from skpro.metrics import ConcordanceHarrell
c_harrell = ConcordanceHarrell()
# ### 3.1 Reducers to create survival regressors
# the three main reduction strategies to create survival regressors:
#
# 1. adding the capability to handle censoring information to a probabilistic supervised regressor
# 2. the above, combined with any strategy to create a probabilistic regressor from an `sklearn` (non-probabilistic) regressors
# 3. strategies that directly create a survival regressor from a tabular `sklearn` regressor
# This section shows examples for all the above.
# ### 3.1.1 adding survival capability to a probabilistic regressor
# this type of reduction strategy takes a probabilistic regressor and adds treatment for censored data.
#
# Examples:
#
# * native use - every probabilistic regressor can be used as a survival regressor, ignoring `C`
# * `FitUncensored` - fits the model on the uncensored data only
# * `ConditionUncensored` - uses `C` as a feature in fitting, and sets it to `C=0` in `Predict`
#
# The first two strategies introduce bias into the model, and should only be applied when the censoring fraction is low.
#
# Example with `ConditionUncensored`:
# In[73]:
from skpro.regression.linear import GLMRegressor
from skpro.survival.compose import ConditionUncensored
# estimator specification - use any skpro regressor, including composites
reg_proba = GLMRegressor()
# turning the regressor into a survival predictor
surv_model = ConditionUncensored(reg_proba)
# fit and predict
surv_model.fit(X_train, y_train, C_train)
y_pred_proba = surv_model.predict_proba(X_test)
# evaluate
c_harrell(y_test, y_pred_proba, C_true=C_test)
# ### 3.1.2 adding probabilistic then survival capability to `sklearn` regressor
# In[74]:
from sklearn.ensemble import RandomForestRegressor
from skpro.regression.residual import ResidualDouble
from skpro.survival.compose import ConditionUncensored
# estimator specification - use any skpro regressor, including composites
reg_mean = RandomForestRegressor()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)
# turning the regressor into a survival predictor
surv_model = ConditionUncensored(reg_proba)
# fit and predict
surv_model.fit(X_train, y_train, C_train)
y_pred_proba = surv_model.predict_proba(X_test)
# evaluate
c_harrell(y_test, y_pred_proba, C_true=C_test)
# ### 3.1.3 bootstrap prediction intervals
# In[75]:
from sklearn.linear_model import LinearRegression
from skpro.regression.bootstrap import BootstrapRegressor
# estimator specification - use any sklearn regressor for reg_mean
reg_mean = LinearRegression()
reg_proba = BootstrapRegressor(reg_mean, n_bootstrap_samples=100)
# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)
# evaluate
c_harrell(y_test, y_pred_proba)
# ### 3.2 Pipelines of `skpro` survival regressors and `sklearn` transformers
# `skpro` survival regressors can be pipelined with `sklearn` transformers, using the `skpro` pipeline.
#
# This ensure presence of `predict_proba` etc in the pipeline object.
#
# The syntax is exactly the same as for `sklearn`'s pipeline,
# or the `skpro` pipeline with non-survival regressors
# In[76]:
from sklearn.impute import SimpleImputer as Imputer
from sklearn.preprocessing import MinMaxScaler
from skpro.regression.compose import Pipeline
from skpro.survival.coxph import CoxPH
# estimator specification
surv_model = CoxPH()
# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
steps=[
("imputer", Imputer()), # an sklearn transformer
("scaler", MinMaxScaler()), # an sklearn transformer
("regressor", surv_model), # an skpro regressor
]
)
# In[77]:
pipe
# In[78]:
# the pipeline behaves as any skpro regressor
pipe.fit(X_train, y_train, C_train)
y_pred = pipe.predict(X=X_test)
y_pred_proba = pipe.predict_proba(X=X_test)
# the pipeline provides the familiar nested `get_params`, `set_params` interface:
#
# nested parameters are keyed `componentname__parametername`
# In[79]:
pipe.get_params()
# pipelines can also be created via simple lists of estimators,
#
# in this case names are generated automatically:
# In[80]:
# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
steps=[
Imputer(), # an sklearn transformer
MinMaxScaler(), # an sklearn transformer
surv_model, # an skpro regressor
]
)
# ### 3.3 Tuning of `skpro` regressors via grid and random search
# `skpro` provides grid and random search tuners to tune arbitrary probabilistic regressors,
#
# using probabilistic metrics.
#
# Survival metrics can be used to tune survival regressors, same as non-survival regressors.
# In[81]:
from sklearn.model_selection import KFold
from skpro.metrics import ConcordanceHarrell
from skpro.model_selection import GridSearchCV
from skpro.survival.coxph import CoxPH
# cross-validation specification for tuner
cv = KFold(n_splits=3)
# estimator to be tuned
estimator = CoxPH()
# tuning grid - partial likelihood or elastic net penalty?
param_grid = {"method": ["lpl", "elastic_net"]}
# metric to be optimized
c_harrell = ConcordanceHarrell()
# specification of the grid search tuner
gscv = GridSearchCV(
estimator=estimator,
param_grid=param_grid,
cv=cv,
scoring=c_harrell,
)
# In[82]:
gscv
# the grid search tuner behaves like any `skpro` survival regressor:
# In[83]:
gscv.fit(X_train, y_train, C_train)
y_pred = gscv.predict(X_test)
y_pred_proba = gscv.predict_proba(X_test)
# random search is similar, except that instead of a grid a parameter sampler should be specified:
# In[84]:
from skpro.model_selection import RandomizedSearchCV
# only difference to GridSearchCV is the param_distributions argument
# specification of the random search parameter sampler
param_distributions = {"method": ["lpl", "elastic_net"]}
# specification of the random search tuner
rscv = RandomizedSearchCV(
estimator=estimator,
param_distributions=param_distributions,
cv=cv,
scoring=c_harrell,
)
# ## 4. Extension guide - implementing your own time-to-event regressor
#
# `skpro` is meant to be easily extensible, for direct contribution to `skpro` as well as for local/private extension with custom methods.
#
# To get started:
#
# * Follow the ["implementing estimator" developer guide](https://skpro.readthedocs.io/en/stable/developer_guide/add_estimators.html)
# * Use the [survival regressor template](https://github.com/sktime/skpro/blob/main/extension_templates/survival.py) to get started
#
# 1. Read through the [survival regression extension template](https://github.com/sktime/skpro/blob/main/extension_templates/survival.py) - this is a `python` file with `todo` blocks that mark the places in which changes need to be added.
# 2. Copy the proba regressor extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the `skpro` or affiliated repository (if contributed extension), inside `skpro.survival`; rename the file and update the file docstring appropriately.
# 3. Address the "todo" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in `__init__`, `_fit`, and at least one of the probabilistic prediction methods, preferably `_predict_proba` (for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.
# 4. To test your estimator manually: import your estimator and run it in the workflows in Section 1; then use it in the compositors in Section 3.
# 5. To test your estimator automatically: call `skpro.utils.check_estimator` on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the `get_test_params` method, according to the extension template.
#
# In case of direct contribution to `skpro` or one of its affiliated packages, additionally:
#
# * Add yourself as an author and/or a maintainer for the new estimator file(s), via `"authors"` and `"maintainers"` tag.
# * Create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.
# * In the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.
# * Before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project.
# ## 5. Summary
#
# * `skpro` is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with `scikit-learn` and `scikit-base` interface specifications.
#
# * `skpro` comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as `scikit-learn` and individual algorithm libraries.
#
# * `skpro` is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles.
# ---
#
# ### Credits:
#
# noteook creation: fkiraly
#
# skpro: https://github.com/sktime/skpro/blob/main/CONTRIBUTORS.md