Bootstrapping model fits¶

Bootstrapping does not really seem superior to simply looking at the times_seen parameter for mutations. Model average is the way to go. So this notebook is no longer tested or included in docs and may be obsolete.

The previous section describes fitting a single model. But we may also want to have confidence estimates for the fit. We can do that via bootstrapping the data set.

Note however that our current recommendation is to use experimental replicates (see later example on Averaging models) rather than bootstrapping whenever possible, as experimental replicates often provide more robust senses of error/variation than just boostrapping a single data set.

The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc. Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.

Get model fit to the data¶

The first step is just to fit a Polyclonal model to all the data we are using. We do similar to the previous notebook for our RBD example, but first shrink the size of the data set to just 10,000 variants to provide more "error" to better illustrate the bootstrapping.

We will call this model fit to all the data we are using the "root" model as it's used as the starting point (root) for the subsequent bootstrapping. Note that data (which we will bootstrap) are attached to this pre-fit model:

In [1]:

# NBVAL_IGNORE_OUTPUT

import numpy

import pandas as pd

from plotnine import *

import polyclonal

# read data
noisy_data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query('library == "avg3muts"')
    .query("concentration in [0.25, 1, 4]")
)

# just keep some variants to make fitting "noisier"
n_keep = 10000
barcodes_to_keep = (
    noisy_data["barcode"].drop_duplicates().sample(n_keep, random_state=1).tolist()
)
noisy_data = noisy_data.query("barcode in @barcodes_to_keep")

# make and fit the root Polyclonal object with all the data we are using
root_poly = polyclonal.Polyclonal(
    data_to_fit=noisy_data,
    n_epitopes=3,
    data_mut_escape_overlap="fill_to_data",
)

opt_res = root_poly.fit(
    logfreq=200,
    reg_escape_weight=0.01,
    reg_uniqueness2_weight=0,
)

# First fitting site-level model.
# Starting optimization of 522 parameters at Sat Feb 11 13:21:17 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0     0.02725      2336.7      2335.3           0           0           0              0               0       1.3149
         200       4.463      404.62       393.9      4.9364           0           0              0               0       5.7827
         400      8.9117      391.84      378.82      7.2693           0           0              0               0       5.7481
         600      14.154      389.29      375.83      7.6232           0           0              0               0       5.8309
         800       18.24      388.57      375.18      7.5455           0           0              0               0       5.8444
         896      20.168      388.55      375.13      7.5847           0           0              0               0       5.8429
# Successfully finished at Sat Feb 11 13:21:37 2023.
# Starting optimization of 5796 parameters at Sat Feb 11 13:21:37 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.022857       487.5         394      87.656  1.2484e-29           0              0               0       5.8429
         200      4.7855       164.5      90.501      54.697      13.393           0              0               0       5.9093
         400      10.349      137.01      83.114      34.726      13.355           0              0               0       5.8132
         600      15.457       130.8       81.21      30.145      13.622           0              0               0       5.8191
         800      20.864       126.2      79.626      26.891      13.894           0              0               0       5.7891
        1000      26.314      124.55      79.001      25.838      13.917           0              0               0        5.795
        1200      32.369       122.7      78.128      25.254      13.536           0              0               0        5.779
        1400      37.446      121.67      77.746      25.017       13.13           0              0               0       5.7724
        1600      42.602      120.94      77.305      24.876      12.985           0              0               0       5.7707
        1800      47.634      120.61      77.224      24.705      12.907           0              0               0        5.773
        2000      52.373      120.46      77.154      24.611      12.922           0              0               0       5.7708
        2200      57.217      120.17      77.047      24.569      12.787           0              0               0        5.767
        2278      59.502      120.16      77.039      24.567      12.785           0              0               0       5.7665
# Successfully finished at Sat Feb 11 13:22:36 2023.

Create and fit bootstrapped models¶

To create the bootstrapped models, we initialize a PolyclonalBootstrap, here just using 5 samples for speed (for real analyses to get good error estimates you may want more on the order of 20 to 100 bootstrap samples). Note it is important that the root model you are using has already been fit to the data! Note also that there is a n_threads option which specifies how many threads should be used for the bootstrapping: by default it's -1 (use all CPUs available), but set to another number if you want to limit CPU usage:

In [2]:

n_bootstrap_samples = 5

bootstrap_poly = polyclonal.PolyclonalBootstrap(
    root_polyclonal=root_poly,
    n_bootstrap_samples=n_bootstrap_samples,
)

Now fit the bootstrapped models:

In [3]:

# NBVAL_IGNORE_OUTPUT

import time

start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models(
    reg_escape_weight=0.01,
    reg_uniqueness2_weight=0,
)
print(f"Fitting took {time.time() - start:.3g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples

Starting fitting bootstrap models at Sat Feb 11 13:22:39 2023
Fitting took 34.3 seconds, finished at Sat Feb 11 13:23:13 2023

Look at summarized results¶

We can get the resulting measurements for the epitope activities and mutation effects both per-replicate and summarized across replicates (mean, median, standard deviation).

Epitope activities¶

Epitope activities. You can see that we can't actually really resolve the class 1 epitope here, as it has negative activity. This is probably because we are using a smaller data set than the full simulated data set fit in the prior example, and so there isn't enough data to effectively resolve this epitope. When you get a result like this, you should probably re-fit the model with just two epitopes:

In [4]:

# NBVAL_IGNORE_OUTPUT
display(bootstrap_poly.activity_wt_df.round(1))

bootstrap_poly.activity_wt_barplot()

	epitope	activity_mean	activity_median
0	1	2.5	2.5
1	2	3.5	3.5
2	3	-0.1	-0.1

Out[4]:

Mutation escape values¶

Mutation escape values for each replicate:

In [5]:

# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df_replicates.round(1).head()

Out[5]:

	epitope	site	wildtype	mutant	mutation	escape	times_seen	bootstrap_replicate
0	1	331	N	A	N331A	0.0	14	1
1	1	331	N	D	N331D	-0.1	11	1
2	1	331	N	E	N331E	-0.0	15	1
3	1	331	N	F	N331F	0.0	14	1
4	1	331	N	G	N331G	0.0	21	1

Mutation escape values summarizes across replicates. Note the frac_bootstrap_replicates column has the fraction of bootstrap replicates with a value for this mutation:

In [6]:

# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df.round(1).head(n=3)

Out[6]:

	epitope	site	wildtype	mutant	mutation	escape_mean	escape_median	escape_std	n_models	times_seen	frac_models
0	1	331	N	A	N331A	-0.1	-0.2	0.1	5	15.0	1.0
1	1	331	N	D	N331D	0.1	-0.0	0.3	5	9.8	1.0
2	1	331	N	E	N331E	0.0	-0.0	0.1	5	13.0	1.0

We can plot the mutation escape values across replicates. The dropdown selects the statistic shown in the heatmap (mean or median), and mouseovers give details on points. Here we initialize times_seen=2 in the slider to initially only show escape values for mutations observed in at least 2 variants, although this can be adjusted via a slider:

In [18]:

# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_plot(
    addtl_slider_stats={"times_seen": 2}, per_model_tooltip=False
)

Out[18]:

Site summaries of mutation escape¶

Site summaries of mutation escape values for replicates:

In [8]:

# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df_replicates().round(1).head()

Out[8]:

	epitope	site	wildtype	mean	total positive	max	min	total negative	n mutations	bootstrap_replicate
0	1	331	N	0.1	2.3	0.9	-0.3	-0.8	16	1
1	1	332	I	0.1	2.9	0.7	-0.2	-0.3	19	1
2	1	333	T	-0.0	0.9	0.6	-0.4	-1.4	18	1
3	1	334	N	0.0	2.0	1.0	-0.6	-2.0	18	1
4	1	335	L	0.0	1.5	0.3	-0.5	-1.3	19	1

Site summaries of mutation escape values summarized (e.g., averaged) across replicates. Note that the metric column now indicates a different row for each site-summary metric type, which is then summarized by its mean, median, and standard deviation. We calculate these statistics only over mutations observed in at least 2 variants:

In [9]:

# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df(min_times_seen=2).round(1).head()

Out[9]:

	epitope	site	wildtype	metric	escape_mean	escape_median	escape_std	n_models	frac_models	n mutations
0	1	331	N	max	0.7	0.7	0.1	5	1.0	16.0
1	1	331	N	mean	0.1	0.1	0.1	5	1.0	16.0
2	1	331	N	min	-0.2	-0.3	0.1	5	1.0	16.0
3	1	331	N	total negative	-0.7	-0.8	0.4	5	1.0	16.0
4	1	331	N	total positive	2.7	2.5	0.7	5	1.0	16.0

Predictions of ICXXs and probabilities of escape¶

We can also use the bootstrapped models to predict the escape properties of new variants.

To illustrate, first get the data we used to fit the model along with the true IC90s for these:

In [10]:

true_data = noisy_data[["aa_substitutions", "IC90"]].drop_duplicates()

Next we get the IC90s for each variant and replicate:

In [11]:

# NBVAL_IGNORE_OUTPUT

bootstrap_poly.icXX_replicates(true_data, x=0.9, col="predicted_IC90").round(1).head(
    n=3
)

Out[11]:

	aa_substitutions	IC90	predicted_IC90	bootstrap_replicate
0		0.1	0.1	1
1	G339Q	0.1	0.1	1
2	D427E	0.1	0.1	1

More often, we want to summarize across replicates with mean / median / standard deviation:

In [12]:

# NBVAL_IGNORE_OUTPUT

pred_ic90s = bootstrap_poly.icXX(true_data, x=0.9, col="predicted_IC90", max_c=50)

pred_ic90s.round(1).head(n=3)

Out[12]:

	aa_substitutions	IC90	mean_predicted_IC90	median_predicted_IC90	n_models	frac_models
0		0.1	0.1	0.1	5	1.0
1	A344E	0.1	0.1	0.1	5	1.0
2	A344E A363I T385S L517V	0.2	0.2	0.1	5	1.0

Plot the mean predictions against actual IC90s:

In [13]:

# NBVAL_IGNORE_OUTPUT

p = (
    ggplot(pred_ic90s)
    + aes("IC90", "mean_predicted_IC90")
    + geom_point()
    + scale_x_log10()
    + scale_y_log10()
)

_ = p.draw()

Predictions are fairly well correlated with actual values:

In [14]:

numpy.log(pred_ic90s[["IC90", "mean_predicted_IC90"]]).corr().round(2)

Out[14]:

	IC90	mean_predicted_IC90
IC90	1.00	0.95
mean_predicted_IC90	0.95	1.00

They are better correlated if we look only at mutations seen in many variants:

In [15]:

numpy.log(
    bootstrap_poly.root_polyclonal.filter_variants_by_seen_muts(
        pred_ic90s,
        min_times_seen=20,
    )[["IC90", "mean_predicted_IC90"]]
).corr().round(2)

Out[15]:

	IC90	mean_predicted_IC90
IC90	1.00	0.95
mean_predicted_IC90	0.95	1.00

We can also directly get the predicted probabilities of escape for variants at differenct concentration per bootstrap replicate:

In [16]:

# NBVAL_IGNORE_OUTPUT

bootstrap_poly.prob_escape_replicates(true_data, concentrations=[1, 4]).round(2).head()

Out[16]:

	aa_substitutions	IC90	concentration	predicted_prob_escape	bootstrap_replicate
0		0.08	1	0.00	1
1	A344E	0.12	1	0.00	1
2	A344E A363I T385S L517V	0.22	1	0.00	1
3	A344E K356Y S373F V445P	0.37	1	0.03	1
4	A344E K378L G381R Q474G H519R R346K	0.13	1	0.00	1

Or summarized across replicates:

In [17]:

# NBVAL_IGNORE_OUTPUT

pred_prob_escape = bootstrap_poly.prob_escape(true_data, concentrations=[1, 4])

pred_prob_escape.round(2).head()

Out[17]:

	aa_substitutions	IC90	mean_predicted_prob_escape	median_predicted_prob_escape	std_predicted_prob_escape	n_models	frac_models
0		0.08	0.00	0.00	0.00	10	2.0
1	A344E	0.12	0.00	0.00	0.00	10	2.0
2	A344E A363I T385S L517V	0.22	0.00	0.00	0.00	10	2.0
3	A344E K356Y S373F V445P	0.37	0.02	0.02	0.02	10	2.0
4	A344E K378L G381R Q474G H519R R346K	0.13	0.00	0.00	0.00	10	2.0