#!/usr/bin/env python
# coding: utf-8

# # Bootstrapping model fits
# 
# **Bootstrapping does not really seem superior to simply looking at the *times_seen* parameter for mutations. Model average is the way to go. So this notebook is no longer tested or included in docs and may be obsolete.**
# 
# The previous section describes fitting a single model.
# But we may also want to have confidence estimates for the fit.
# We can do that via bootstrapping the data set.
# 
# Note however that our current recommendation is to use experimental replicates (see later example on *Averaging models*) rather than bootstrapping whenever possible, as experimental replicates often provide more robust senses of error/variation than just boostrapping a single data set.
# 
# The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc.
# Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.

# ## Get model fit to the data
# The first step is just to fit a `Polyclonal` model to all the data we are using.
# We do similar to the previous notebook for our RBD example, but first shrink the size of the data set to just 10,000 variants to provide more "error" to better illustrate the bootstrapping.
# 
# We will call this model fit to all the data we are using the "root" model as it's used as the starting point (root) for the subsequent bootstrapping.
# Note that data (which we will bootstrap) are attached to this pre-fit model:

# In[1]:


# NBVAL_IGNORE_OUTPUT

import numpy

import pandas as pd

from plotnine import *

import polyclonal

# read data
noisy_data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query('library == "avg3muts"')
    .query("concentration in [0.25, 1, 4]")
)

# just keep some variants to make fitting "noisier"
n_keep = 10000
barcodes_to_keep = (
    noisy_data["barcode"].drop_duplicates().sample(n_keep, random_state=1).tolist()
)
noisy_data = noisy_data.query("barcode in @barcodes_to_keep")

# make and fit the root Polyclonal object with all the data we are using
root_poly = polyclonal.Polyclonal(
    data_to_fit=noisy_data,
    n_epitopes=3,
    data_mut_escape_overlap="fill_to_data",
)

opt_res = root_poly.fit(
    logfreq=200,
    reg_escape_weight=0.01,
    reg_uniqueness2_weight=0,
)


# ## Create and fit bootstrapped models
# To create the bootstrapped models, we initialize a `PolyclonalBootstrap`, here just using 5 samples for speed (for real analyses to get good error estimates you may want more on the order of 20 to 100 bootstrap samples).
# Note it is important that the root model you are using has already been fit to the data!
# Note also that there is a `n_threads` option which specifies how many threads should be used for the bootstrapping: by default it's -1 (use all CPUs available), but set to another number if you want to limit CPU usage:

# In[2]:


n_bootstrap_samples = 5

bootstrap_poly = polyclonal.PolyclonalBootstrap(
    root_polyclonal=root_poly,
    n_bootstrap_samples=n_bootstrap_samples,
)


# Now fit the bootstrapped models:

# In[3]:


# NBVAL_IGNORE_OUTPUT

import time

start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models(
    reg_escape_weight=0.01,
    reg_uniqueness2_weight=0,
)
print(f"Fitting took {time.time() - start:.3g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples


# ## Look at summarized results
# We can get the resulting measurements for the epitope activities and mutation effects both per-replicate and summarized across replicates (mean, median, standard deviation).
# 
# ### Epitope activities
# Epitope activities.
# You can see that we can't actually really resolve the class 1 epitope here, as it has negative activity. This is probably because we are using a smaller data set than the full simulated data set fit in the prior example, and so there isn't enough data to effectively resolve this epitope.
# When you get a result like this, you should probably re-fit the model with just two epitopes:

# In[4]:


# NBVAL_IGNORE_OUTPUT
display(bootstrap_poly.activity_wt_df.round(1))

bootstrap_poly.activity_wt_barplot()


# ### Mutation escape values
# Mutation escape values for each replicate:

# In[5]:


# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df_replicates.round(1).head()


# Mutation escape values summarizes across replicates.
# Note the `frac_bootstrap_replicates` column has the fraction of bootstrap replicates with a value for this mutation:

# In[6]:


# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df.round(1).head(n=3)


# We can plot the mutation escape values across replicates.
# The dropdown selects the statistic shown in the heatmap (mean or median), and mouseovers give details on points.
# Here we initialize `times_seen=2` in the slider to initially only show escape values for mutations observed in at least 2 variants, although this can be adjusted via a slider:

# In[18]:


# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_plot(
    addtl_slider_stats={"times_seen": 2}, per_model_tooltip=False
)


# ### Site summaries of mutation escape
# Site summaries of mutation escape values for replicates:

# In[8]:


# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df_replicates().round(1).head()


# Site summaries of mutation escape values summarized (e.g., averaged) across replicates.
# Note that the `metric` column now indicates a different row for each site-summary metric type, which is then summarized by its mean, median, and standard deviation.
# We calculate these statistics only over mutations observed in at least 2 variants:

# In[9]:


# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df(min_times_seen=2).round(1).head()


# ## Predictions of ICXXs and probabilities of escape
# We can also use the bootstrapped models to predict the escape properties of new variants.
# 
# To illustrate, first get the data we used to fit the model along with the true IC90s for these:

# In[10]:


true_data = noisy_data[["aa_substitutions", "IC90"]].drop_duplicates()


# Next we get the IC90s for each variant and replicate:

# In[11]:


# NBVAL_IGNORE_OUTPUT

bootstrap_poly.icXX_replicates(true_data, x=0.9, col="predicted_IC90").round(1).head(
    n=3
)


# More often, we want to summarize across replicates with mean / median / standard deviation:

# In[12]:


# NBVAL_IGNORE_OUTPUT

pred_ic90s = bootstrap_poly.icXX(true_data, x=0.9, col="predicted_IC90", max_c=50)

pred_ic90s.round(1).head(n=3)


# Plot the mean predictions against actual IC90s:

# In[13]:


# NBVAL_IGNORE_OUTPUT

p = (
    ggplot(pred_ic90s)
    + aes("IC90", "mean_predicted_IC90")
    + geom_point()
    + scale_x_log10()
    + scale_y_log10()
)

_ = p.draw()


# Predictions are fairly well correlated with actual values:

# In[14]:


numpy.log(pred_ic90s[["IC90", "mean_predicted_IC90"]]).corr().round(2)


# They are better correlated if we look only at mutations seen in many variants:

# In[15]:


numpy.log(
    bootstrap_poly.root_polyclonal.filter_variants_by_seen_muts(
        pred_ic90s,
        min_times_seen=20,
    )[["IC90", "mean_predicted_IC90"]]
).corr().round(2)


# We can also directly get the predicted probabilities of escape for variants at differenct concentration per bootstrap replicate:

# In[16]:


# NBVAL_IGNORE_OUTPUT

bootstrap_poly.prob_escape_replicates(true_data, concentrations=[1, 4]).round(2).head()


# Or summarized across replicates:

# In[17]:


# NBVAL_IGNORE_OUTPUT

pred_prob_escape = bootstrap_poly.prob_escape(true_data, concentrations=[1, 4])

pred_prob_escape.round(2).head()