Bootstrapping does not really seem superior to simply looking at the times_seen parameter for mutations. Model average is the way to go. So this notebook is no longer tested or included in docs and may be obsolete.
The previous section describes fitting a single model. But we may also want to have confidence estimates for the fit. We can do that via bootstrapping the data set.
Note however that our current recommendation is to use experimental replicates (see later example on Averaging models) rather than bootstrapping whenever possible, as experimental replicates often provide more robust senses of error/variation than just boostrapping a single data set.
The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc. Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.
The first step is just to fit a Polyclonal
model to all the data we are using.
We do similar to the previous notebook for our RBD example, but first shrink the size of the data set to just 10,000 variants to provide more "error" to better illustrate the bootstrapping.
We will call this model fit to all the data we are using the "root" model as it's used as the starting point (root) for the subsequent bootstrapping. Note that data (which we will bootstrap) are attached to this pre-fit model:
# NBVAL_IGNORE_OUTPUT
import numpy
import pandas as pd
from plotnine import *
import polyclonal
# read data
noisy_data = (
pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
.query('library == "avg3muts"')
.query("concentration in [0.25, 1, 4]")
)
# just keep some variants to make fitting "noisier"
n_keep = 10000
barcodes_to_keep = (
noisy_data["barcode"].drop_duplicates().sample(n_keep, random_state=1).tolist()
)
noisy_data = noisy_data.query("barcode in @barcodes_to_keep")
# make and fit the root Polyclonal object with all the data we are using
root_poly = polyclonal.Polyclonal(
data_to_fit=noisy_data,
n_epitopes=3,
data_mut_escape_overlap="fill_to_data",
)
opt_res = root_poly.fit(
logfreq=200,
reg_escape_weight=0.01,
reg_uniqueness2_weight=0,
)
# First fitting site-level model. # Starting optimization of 522 parameters at Sat Feb 11 13:21:17 2023. step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity 0 0.02725 2336.7 2335.3 0 0 0 0 0 1.3149 200 4.463 404.62 393.9 4.9364 0 0 0 0 5.7827 400 8.9117 391.84 378.82 7.2693 0 0 0 0 5.7481 600 14.154 389.29 375.83 7.6232 0 0 0 0 5.8309 800 18.24 388.57 375.18 7.5455 0 0 0 0 5.8444 896 20.168 388.55 375.13 7.5847 0 0 0 0 5.8429 # Successfully finished at Sat Feb 11 13:21:37 2023. # Starting optimization of 5796 parameters at Sat Feb 11 13:21:37 2023. step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity 0 0.022857 487.5 394 87.656 1.2484e-29 0 0 0 5.8429 200 4.7855 164.5 90.501 54.697 13.393 0 0 0 5.9093 400 10.349 137.01 83.114 34.726 13.355 0 0 0 5.8132 600 15.457 130.8 81.21 30.145 13.622 0 0 0 5.8191 800 20.864 126.2 79.626 26.891 13.894 0 0 0 5.7891 1000 26.314 124.55 79.001 25.838 13.917 0 0 0 5.795 1200 32.369 122.7 78.128 25.254 13.536 0 0 0 5.779 1400 37.446 121.67 77.746 25.017 13.13 0 0 0 5.7724 1600 42.602 120.94 77.305 24.876 12.985 0 0 0 5.7707 1800 47.634 120.61 77.224 24.705 12.907 0 0 0 5.773 2000 52.373 120.46 77.154 24.611 12.922 0 0 0 5.7708 2200 57.217 120.17 77.047 24.569 12.787 0 0 0 5.767 2278 59.502 120.16 77.039 24.567 12.785 0 0 0 5.7665 # Successfully finished at Sat Feb 11 13:22:36 2023.
To create the bootstrapped models, we initialize a PolyclonalBootstrap
, here just using 5 samples for speed (for real analyses to get good error estimates you may want more on the order of 20 to 100 bootstrap samples).
Note it is important that the root model you are using has already been fit to the data!
Note also that there is a n_threads
option which specifies how many threads should be used for the bootstrapping: by default it's -1 (use all CPUs available), but set to another number if you want to limit CPU usage:
n_bootstrap_samples = 5
bootstrap_poly = polyclonal.PolyclonalBootstrap(
root_polyclonal=root_poly,
n_bootstrap_samples=n_bootstrap_samples,
)
Now fit the bootstrapped models:
# NBVAL_IGNORE_OUTPUT
import time
start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models(
reg_escape_weight=0.01,
reg_uniqueness2_weight=0,
)
print(f"Fitting took {time.time() - start:.3g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples
Starting fitting bootstrap models at Sat Feb 11 13:22:39 2023 Fitting took 34.3 seconds, finished at Sat Feb 11 13:23:13 2023
We can get the resulting measurements for the epitope activities and mutation effects both per-replicate and summarized across replicates (mean, median, standard deviation).
Epitope activities. You can see that we can't actually really resolve the class 1 epitope here, as it has negative activity. This is probably because we are using a smaller data set than the full simulated data set fit in the prior example, and so there isn't enough data to effectively resolve this epitope. When you get a result like this, you should probably re-fit the model with just two epitopes:
# NBVAL_IGNORE_OUTPUT
display(bootstrap_poly.activity_wt_df.round(1))
bootstrap_poly.activity_wt_barplot()
epitope | activity_mean | activity_median | activity_std | |
---|---|---|---|---|
0 | 1 | 2.5 | 2.5 | 0.0 |
1 | 2 | 3.5 | 3.5 | 0.0 |
2 | 3 | -0.1 | -0.1 | 0.0 |
Mutation escape values for each replicate:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df_replicates.round(1).head()
epitope | site | wildtype | mutant | mutation | escape | times_seen | bootstrap_replicate | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 331 | N | A | N331A | 0.0 | 14 | 1 |
1 | 1 | 331 | N | D | N331D | -0.1 | 11 | 1 |
2 | 1 | 331 | N | E | N331E | -0.0 | 15 | 1 |
3 | 1 | 331 | N | F | N331F | 0.0 | 14 | 1 |
4 | 1 | 331 | N | G | N331G | 0.0 | 21 | 1 |
Mutation escape values summarizes across replicates.
Note the frac_bootstrap_replicates
column has the fraction of bootstrap replicates with a value for this mutation:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df.round(1).head(n=3)
epitope | site | wildtype | mutant | mutation | escape_mean | escape_median | escape_std | n_models | times_seen | frac_models | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 331 | N | A | N331A | -0.1 | -0.2 | 0.1 | 5 | 15.0 | 1.0 |
1 | 1 | 331 | N | D | N331D | 0.1 | -0.0 | 0.3 | 5 | 9.8 | 1.0 |
2 | 1 | 331 | N | E | N331E | 0.0 | -0.0 | 0.1 | 5 | 13.0 | 1.0 |
We can plot the mutation escape values across replicates.
The dropdown selects the statistic shown in the heatmap (mean or median), and mouseovers give details on points.
Here we initialize times_seen=2
in the slider to initially only show escape values for mutations observed in at least 2 variants, although this can be adjusted via a slider:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_plot(
addtl_slider_stats={"times_seen": 2}, per_model_tooltip=False
)
Site summaries of mutation escape values for replicates:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df_replicates().round(1).head()
epitope | site | wildtype | mean | total positive | max | min | total negative | n mutations | bootstrap_replicate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 331 | N | 0.1 | 2.3 | 0.9 | -0.3 | -0.8 | 16 | 1 |
1 | 1 | 332 | I | 0.1 | 2.9 | 0.7 | -0.2 | -0.3 | 19 | 1 |
2 | 1 | 333 | T | -0.0 | 0.9 | 0.6 | -0.4 | -1.4 | 18 | 1 |
3 | 1 | 334 | N | 0.0 | 2.0 | 1.0 | -0.6 | -2.0 | 18 | 1 |
4 | 1 | 335 | L | 0.0 | 1.5 | 0.3 | -0.5 | -1.3 | 19 | 1 |
Site summaries of mutation escape values summarized (e.g., averaged) across replicates.
Note that the metric
column now indicates a different row for each site-summary metric type, which is then summarized by its mean, median, and standard deviation.
We calculate these statistics only over mutations observed in at least 2 variants:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df(min_times_seen=2).round(1).head()
epitope | site | wildtype | metric | escape_mean | escape_median | escape_std | n_models | frac_models | n mutations | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 331 | N | max | 0.7 | 0.7 | 0.1 | 5 | 1.0 | 16.0 |
1 | 1 | 331 | N | mean | 0.1 | 0.1 | 0.1 | 5 | 1.0 | 16.0 |
2 | 1 | 331 | N | min | -0.2 | -0.3 | 0.1 | 5 | 1.0 | 16.0 |
3 | 1 | 331 | N | total negative | -0.7 | -0.8 | 0.4 | 5 | 1.0 | 16.0 |
4 | 1 | 331 | N | total positive | 2.7 | 2.5 | 0.7 | 5 | 1.0 | 16.0 |
We can also use the bootstrapped models to predict the escape properties of new variants.
To illustrate, first get the data we used to fit the model along with the true IC90s for these:
true_data = noisy_data[["aa_substitutions", "IC90"]].drop_duplicates()
Next we get the IC90s for each variant and replicate:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.icXX_replicates(true_data, x=0.9, col="predicted_IC90").round(1).head(
n=3
)
aa_substitutions | IC90 | predicted_IC90 | bootstrap_replicate | |
---|---|---|---|---|
0 | 0.1 | 0.1 | 1 | |
1 | G339Q | 0.1 | 0.1 | 1 |
2 | D427E | 0.1 | 0.1 | 1 |
More often, we want to summarize across replicates with mean / median / standard deviation:
# NBVAL_IGNORE_OUTPUT
pred_ic90s = bootstrap_poly.icXX(true_data, x=0.9, col="predicted_IC90", max_c=50)
pred_ic90s.round(1).head(n=3)
aa_substitutions | IC90 | mean_predicted_IC90 | median_predicted_IC90 | std_predicted_IC90 | n_models | frac_models | |
---|---|---|---|---|---|---|---|
0 | 0.1 | 0.1 | 0.1 | 0.0 | 5 | 1.0 | |
1 | A344E | 0.1 | 0.1 | 0.1 | 0.0 | 5 | 1.0 |
2 | A344E A363I T385S L517V | 0.2 | 0.2 | 0.1 | 0.0 | 5 | 1.0 |
Plot the mean predictions against actual IC90s:
# NBVAL_IGNORE_OUTPUT
p = (
ggplot(pred_ic90s)
+ aes("IC90", "mean_predicted_IC90")
+ geom_point()
+ scale_x_log10()
+ scale_y_log10()
)
_ = p.draw()
Predictions are fairly well correlated with actual values:
numpy.log(pred_ic90s[["IC90", "mean_predicted_IC90"]]).corr().round(2)
IC90 | mean_predicted_IC90 | |
---|---|---|
IC90 | 1.00 | 0.95 |
mean_predicted_IC90 | 0.95 | 1.00 |
They are better correlated if we look only at mutations seen in many variants:
numpy.log(
bootstrap_poly.root_polyclonal.filter_variants_by_seen_muts(
pred_ic90s,
min_times_seen=20,
)[["IC90", "mean_predicted_IC90"]]
).corr().round(2)
IC90 | mean_predicted_IC90 | |
---|---|---|
IC90 | 1.00 | 0.95 |
mean_predicted_IC90 | 0.95 | 1.00 |
We can also directly get the predicted probabilities of escape for variants at differenct concentration per bootstrap replicate:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.prob_escape_replicates(true_data, concentrations=[1, 4]).round(2).head()
aa_substitutions | IC90 | concentration | predicted_prob_escape | bootstrap_replicate | |
---|---|---|---|---|---|
0 | 0.08 | 1 | 0.00 | 1 | |
1 | A344E | 0.12 | 1 | 0.00 | 1 |
2 | A344E A363I T385S L517V | 0.22 | 1 | 0.00 | 1 |
3 | A344E K356Y S373F V445P | 0.37 | 1 | 0.03 | 1 |
4 | A344E K378L G381R Q474G H519R R346K | 0.13 | 1 | 0.00 | 1 |
Or summarized across replicates:
# NBVAL_IGNORE_OUTPUT
pred_prob_escape = bootstrap_poly.prob_escape(true_data, concentrations=[1, 4])
pred_prob_escape.round(2).head()
aa_substitutions | IC90 | mean_predicted_prob_escape | median_predicted_prob_escape | std_predicted_prob_escape | n_models | frac_models | |
---|---|---|---|---|---|---|---|
0 | 0.08 | 0.00 | 0.00 | 0.00 | 10 | 2.0 | |
1 | A344E | 0.12 | 0.00 | 0.00 | 0.00 | 10 | 2.0 |
2 | A344E A363I T385S L517V | 0.22 | 0.00 | 0.00 | 0.00 | 10 | 2.0 |
3 | A344E K356Y S373F V445P | 0.37 | 0.02 | 0.02 | 0.02 | 10 | 2.0 |
4 | A344E K378L G381R Q474G H519R R346K | 0.13 | 0.00 | 0.00 | 0.00 | 10 | 2.0 |