#!/usr/bin/env python
# coding: utf-8

# # Informative distances and summary statistics

# Approximate Bayesian computation (ABC) relies on the efficient comparison of relevant features in simulated and observed data, via distance metrics and potentially summary statistics. Separately, methods have been developed to adaptively scale-normalize the distance metric, and to semi-automatically derive informative, low-dimensional summary statistics.
# 
# In the notebook on "Adaptive distances" we demonstrated how distances adjusting weights to normalize scales are beneficial for heterogeneous, including outlier-corrupted, data. However, when parts of the data are uninformative, it is desirable to further concentrate the analysis on informative data points. Various methods have been develoepd to capture information of data on parameters in a low-dimensional summary statistics representation, see e.g. [Blum et al. 2013](https://doi.org/10.1214/12-STS406) for a review. A particular approach constructs summary statistics as outputs of regression models of parameters on data, see the seminal work by [Fearnhead and Prangle 2012](https://doi.org/10.1111/j.1467-9868.2011.01010.x). In this notebook, we illustrate the use of regression methods to construct informative summary statistics and sensitivity distance weights in pyABC.

# In[1]:


# install if not done yet
get_ipython().system('pip install pyabc[plotly] --quiet')


# In[1]:


import numpy as np
import scipy as sp
import tempfile
import matplotlib.pyplot as plt
from functools import partial
import logging
from IPython.display import SVG,display

import pyabc
from pyabc.distance import *
from pyabc.predictor import *
from pyabc.sumstat import *
from pyabc.util import EventIxs, ParTrafo, dict2arrlabels

pyabc.settings.set_figure_params("pyabc")  # for beautified plots

# for debugging
for logger in ["ABC.Distance", "ABC.Predictor", "ABC.Sumstat"]:
    logging.getLogger(logger).setLevel(logging.DEBUG)


# ## Simple illustration example

# To illustrate informativeness of data points, we consider a simple test problem. It consists of a single model output $y_1$ informative of parameter $p_1$, and uninformative model outputs $y_2$.

# In[2]:


# problem definition

sigmas = {"p1": 0.1}

def model(p):
    return {"y1": p["p1"] + 1 + sigmas["p1"] * np.random.normal(),
            "y2": 2 + 0.1 * np.random.normal(size=3)}

gt_par = {"p1": 3}

data = {"y1": gt_par["p1"] + 1, "y2": 2 * np.ones(shape=3)}

prior_bounds = {"p1": (0, 10)}

prior = pyabc.Distribution(
    **{
        key: pyabc.RV("uniform", lb, ub - lb)
        for key, (lb, ub) in prior_bounds.items()
    },
)


# We employ three approaches to perform inference on this problem.
# 
# Firstly, we use a distance adaptively scale-normalizing all statistics by their respective in-sample median absolute deviations (MAD), as introduced in the "Adaptive distances" notebook ("L1+Ada.+MAD").
# 
# Secondly, we employ an approach similar to [Fearnhead and Prangle 2012](https://doi.org/10.1111/j.1467-9868.2011.01010.x), using a linear regression model, trained after 40% of the total sample budget, as summary statistic, with a simple L1 distance ("L1+StatLR").
# 
# Thirdly, we complement the MAD scale-normalizing weights by sensitivity weights, derived via normalized sensitivities of a linear regression model trained similarly to the second approach. This method thus accounts for informativeness by re-weighting of model outputs, without explicitly employing a low-dimensional summary statistics representation.

# In[3]:


# analysis definition

pop_size = 100
total_sims = 3000
fit_sims = 0.4 * total_sims

YPredictor = LinearPredictor
#YPredictor = MLPPredictor

distances = {
    "L1+Ada.+MAD": AdaptivePNormDistance(
        p=1,
        # adaptive scale normalization
        scale_function=mad,
    ),
    "L1+StatLR": PNormDistance(
        p=1,
        # regression-based summary statistics
        sumstat=PredictorSumstat(
            # regression model used
            predictor=YPredictor(),
            # when to fit the regression model
            fit_ixs=EventIxs(sims=fit_sims),
        ),
    ),
    "L1+Ada.+MAD+SensiLR": InfoWeightedPNormDistance(
        p=1,
        # adaptive scale normalization
        scale_function=mad,
        # regression model used to define sensitivity weights
        predictor=YPredictor(),
        # when to fit the regression model
        fit_info_ixs=EventIxs(sims=fit_sims),
    ),
}

colors = {
    distance_id: f"C{i}"
    for i, distance_id in enumerate(distances)
}


# We perform the analysis using all above distance functions and summary statistics. Additionally, below we specify various logging files to capture relevant information for further analysis.

# In[4]:


# runs

db_file = tempfile.mkstemp(suffix=".db")[1]

scale_log_file = tempfile.mkstemp()[1]
info_log_file = tempfile.mkstemp()[1]
info_sample_log_file = tempfile.mkstemp()[1]

hs = []
for distance_id, distance in distances.items():
    print(distance_id)
    if isinstance(distance, AdaptivePNormDistance):
        distance.scale_log_file = f"{scale_log_file}_{distance_id}.json"
    if isinstance(distance, InfoWeightedPNormDistance):
        distance.info_log_file = f"{info_log_file}_{distance_id}.json"
        distance.info_sample_log_file = f"{info_sample_log_file}_{distance_id}"

    abc = pyabc.ABCSMC(model, prior, distance, population_size=pop_size)
    h = abc.new(db="sqlite:///" + db_file, observed_sum_stat=data)
    abc.run(max_total_nr_simulations=total_sims)
    hs.append(h)


# The comparison of the obtained posterior approximations with the true posterior reveals that L1+Ada.+MAD gave a worse fit compared to the other approaches. This is because it only applies scale-normalization, but does not account for informativeness of data, and thus spends a lot of time on fitting $y_2$.

# In[5]:


# plot ABC posterior approximations

fig, axes = plt.subplots(ncols=len(prior_bounds), figsize=(6, 4))
if len(prior_bounds) == 1:
    axes = [axes]

# plot ground truth

def unnorm_1d_normal_pdf(p, y_obs, sigma, p_to_y = None):
    """Non-normalized 1-d normal density.
    
    Parameters
    ----------
    p: Parameter to evaluate at.
    y_obs: Observed data.
    sigma: Noise standard deviation.
    p_to_y: Function to deterministically transform p to simulated data.
    
    Returns
    -------
    pd: Probability density/densities at p.
    """
    if p_to_y is None:
        p_to_y = lambda p: p
    y = p_to_y(p)
    pd = np.exp(- (y - y_obs)**2 / (2 * sigma**2))
    return pd

for i_par, par in enumerate(gt_par.keys()):
    # define parameter-simulation transformation
    p_to_y = lambda p: p+1
    # observed data corresponding to parameter
    y_obs = p_to_y(gt_par[par])
    # bounds
    xmin, xmax = prior_bounds[par]
    # standard deviation
    sigma = sigmas[par]

    # pdf as function of only p
    pdf = partial(
        unnorm_1d_normal_pdf, y_obs=y_obs, sigma=sigma, p_to_y=p_to_y,
    )

    # integrate density
    norm = sp.integrate.quad(pdf, xmin, xmax)[0]

    # plot density
    xs = np.linspace(xmin, xmax, 300)
    axes[i_par].plot(
        xs, pdf(xs) / norm, linestyle="dashed",
        color="grey",label="ground truth",
    )

# plot ABC approximations

for i_par, par in enumerate(prior_bounds.keys()):
    for distance_id, h in zip(distances.keys(), hs):
        pyabc.visualization.plot_kde_1d_highlevel(
            h, x=par, xname=par,
            xmin=prior_bounds[par][0], xmax=prior_bounds[par][1],
            ax=axes[i_par], label=distance_id,
            numx=500,
        )

# prettify
for ax in axes[1:]:
    ax.set_ylabel(None)
fig.tight_layout(rect=(0, 0.1, 1, 1))
axes[-1].legend()


# Via the log files, we can further examine the employed weights, firstly scale-normalizing weights based on MAD, and secondly sensitivity weights quantifying informativeness. Indeed, while both L1+Ada.+MAD and L1+Ada.+MAD+SensiLR assign large weights to $y_2$, the additional sensitivity weights employed by L1+Ada.+MAD+SensiLR counteract this by assigning a large weight to $y_1$.

# In[6]:


# plot weights

fig, axes = plt.subplots(nrows=2, ncols=len(gt_par), figsize=(4, 8))

# scale weights

scale_distance_ids = [
    distance_id for distance_id in distances.keys()
    if "Ada." in distance_id and "Stat" not in distance_id
]
scale_log_files = []
for i_dist, distance_id in enumerate(scale_distance_ids):
    scale_log_files.append(f"{scale_log_file}_{distance_id}.json")

pyabc.visualization.plot_distance_weights(
    scale_log_files,
    labels=scale_distance_ids,
    colors=[colors[distance_id] for distance_id in scale_distance_ids],
    xlabel="Model output",
    title="Scale weights",
    ax=axes[0],
    keys=dict2arrlabels(data, keys=data.keys()),
)

# info weights

info_distance_ids = [
    distance_id for distance_id in distances.keys()
    if "Sensi" in distance_id
]
info_log_files = []
for i_dist, distance_id in enumerate(info_distance_ids):
    info_log_files.append(f"{info_log_file}_{distance_id}.json")

pyabc.visualization.plot_distance_weights(
    info_log_files,
    labels=info_distance_ids,
    colors=[colors[distance_id] for distance_id in info_distance_ids],
    xlabel="Model output",
    title="Sensitivity weights",
    ax=axes[1],
    keys=dict2arrlabels(data, keys=data.keys()),
)

fig.tight_layout()


# To further understand the employed sensitivity matrix, we can visualize the connections between model outputs and parameters (in this case a single one) via a "Sankey" flow diagram. In this case, this gives no further information beyond the above weight diagram.

# In[7]:


# plot flow diagram

fig = pyabc.visualization.plot_sensitivity_sankey(
    info_sample_log_file=f"{info_sample_log_file}_L1+Ada.+MAD+SensiLR",
    t=f"{info_log_file}_L1+Ada.+MAD+SensiLR.json",
    h=hs[-1],
    predictor=LinearPredictor(),
)

# here just showing a non-interactive plot to reduce storage
img_file = tempfile.mkstemp(suffix=(".svg"))[1]
fig.write_image(img_file)
display(SVG(img_file))


# ## Challenging problem features

# Now, we turn to a slightly more challenging problem:
# 
# * $y_1\sim\mathcal{N}(\theta_1,0.1^2)$ is informative of $\theta_1$, with a relatively wide corresponding prior $\theta_1\sim U[-7, 7]$,
# * $y_2\sim\mathcal{N}(\theta_2,100^2)$ is informative of $\theta_2$, with corresponding prior $\theta_2\sim U[-700, 700]$,
# * $y_3\sim\mathcal{N}(\theta_3, 4 \cdot 100^2)^{\otimes 4}$ is a four-dimensional vector informative of $\theta_3$, with corresponding prior $\theta_3\sim U[-700, 700]$,
# * $y_4\sim\mathcal{N}(\theta_4^2, 0.1^2)$ is informative of $\theta_4$, with corresponding symmetric prior $\theta_4\sim U[-1, 1]$, however is quadratic in the parameter, resulting in a bimodal posterior distribution for $y_{\text{obs},4}\neq 0$,
# * $y_5\sim\mathcal{N}(0, 10)^{\otimes 10}$ is an uninformative 10-dimensional vector.
# 
# The problem encompasses multiple challenging features established methods have problems with:
# 
# * A substantial part of the data, $y_5$, is uninformative, such that approaches not accounting for informativeness of data may converge slower.
# * Both data and parameters are on different scales, such that approaches comparing data, or, via regression-based summary statistics, parameters, without normalization may be biased towards large-scale variables. Further, e.g. the prior of $\theta_1$ is relatively wide, such that pre-calibrated weighting is sub-optimal, as discussed in \citet{Prangle2017}.
# * $y_4$ is quadratic in $\theta_4$ and symmetric over the prior, such that the inverse first-order regression approaches as in FearnheadPra2012} cannot capture a meaningful relationship.
# * While the distributions of $y_3$ and $y_4$ are such that the posterior distributions of $\theta_3$, $\theta_4$ are identical, in solely scale-normalized approaches such as [Fearnhead and Prangle 2012](https://doi.org/10.1111/j.1467-9868.2011.01010.x), the impact of $y_4$ on the distance value is roughly four times as high as that of $y_3$, leading to potentially uneven convergence.

# In[8]:


# problem definition

sigmas = {"p1": 1e-1, "p2": 1e2, "p3": 1e2, "p4": 1e-1}

def model(p):
    return {
        "y1": p["p1"] + sigmas["p1"] * np.random.normal(),
        "y2": p["p2"] + sigmas["p2"] * np.random.normal(),
        "y3": p["p3"] + np.sqrt(4 * sigmas["p3"]**2) * np.random.normal(size=4),
        "y4": p["p4"] ** 2 + sigmas["p4"] * np.random.normal(),
        "y5": 1e1 * np.random.normal(size=10),
    }

prior_bounds = {
    "p1": (-7e0, 7e0),
    "p2": (-7e2, 7e2),
    "p3": (-7e2, 7e2),
    "p4": (-1e0, 1e0),
}

prior = pyabc.Distribution(
    **{
        key: pyabc.RV("uniform", lb, ub - lb)
        for key, (lb, ub) in prior_bounds.items()
    },
)

gt_par = {"p1": 0, "p2": 0, "p3": 0, "p4": 0.5}
data = {"y1": 0, "y2": 0, "y3": 0 * np.ones(4), "y4": 0.5**2, "y5": 0 * np.ones(10)}


# To tackle these problems, we suggest to firstly consistenly employ scale normalization, both on the raw model outputs and on the level of summary statistics. Secondly, we suggest to instead of only inferring a mapping $s: y \mapsto \theta$, we target augmented parameter vectors, $s: y \mapsto \lambda(\theta)$, with e.g. $\lambda(\theta) = (\theta^1,\ldots,\theta^4)$. This practically allows to break symmetry, e.g. if only $\theta^2$ can be expressed as a function of the data. Conceptually, this further allows to obtain a more accurate description of the posterior distribution, as the summary statistics may be regarded as approximations to $s(y) = \mathbb{E}[\lambda(\theta)|y]$, using which as summary statistics preserves the corresponding posterior moments, i.e.
# 
# $$\lim_{\varepsilon\rightarrow 0}\mathbb{E}_{\pi_{\text{ABC},\varepsilon}}[\lambda(\Theta)|s(y_\text{obs})] = \mathbb{E}[\lambda(\Theta)|Y=y_\text{obs}].$$
# 
# Methods employing scale normalization, accounting for informativeness, and augmented regression targets, are L1+Ada.+MAD+StatLR+P4, which uses regression-based summary statistics, and L1+Ada.+MAD+SensiLR+P4, which uses sensitivity weights.
# For comparison, we consider L1+Ada.+MAD only normalizing scales, and L1+StatLR, using non-scale normaled summary statistics, as well as L1+Ada.+MAD+StatLR and L1+Ada.+MAD+SensiLR using only a subset of methods.

# In[9]:


# analysis definition

pop_size = 1000
total_sims = 50000
par_trafos = [lambda x: x, lambda x: x**2, lambda x: x**3, lambda x: x**4]
fit_sims = 0.4 * total_sims

YPredictor = LinearPredictor
#YPredictor = MLPPredictor

distances = {
    "L1+Ada.+MAD": AdaptivePNormDistance(
        p=1,
        scale_function=mad,
    ),
    "L1+StatLR": PNormDistance(
        p=1,
        sumstat=PredictorSumstat(
            predictor=YPredictor(normalize_features=False, normalize_labels=False),
            fit_ixs=EventIxs(sims=fit_sims),
        ),
    ),
    "L1+Ada.+MAD+StatLR": AdaptivePNormDistance(
        p=1,
        scale_function=mad,
        sumstat=PredictorSumstat(
            predictor=YPredictor(),
            fit_ixs=EventIxs(sims=fit_sims),
        ),
    ),
    "L1+Ada.+MAD+StatLR+P4": AdaptivePNormDistance(
        p=1,
        scale_function=mad,
        sumstat=PredictorSumstat(
            predictor=YPredictor(),
            fit_ixs=EventIxs(sims=fit_sims),
            par_trafo=ParTrafo(trafos=par_trafos),
        ),
    ),
    "L1+Ada.+MAD+SensiLR": InfoWeightedPNormDistance(
        p=1,
        scale_function=mad,
        predictor=YPredictor(),
        fit_info_ixs=EventIxs(sims=fit_sims),
        feature_normalization="mad",
    ),
    "L1+Ada.+MAD+SensiLR+P4": InfoWeightedPNormDistance(
        p=1,
        scale_function=mad,
        predictor=YPredictor(),
        fit_info_ixs=EventIxs(sims=fit_sims),
        feature_normalization="mad",
        par_trafo=ParTrafo(trafos=par_trafos),
    ),
}

colors = {
    distance_id: f"C{i}"
    for i, distance_id in enumerate(distances)
}


# For the analysis, we suggest the use of sufficiently large population sizes, as the process model is more complex.

# In[10]:


get_ipython().run_cell_magic('time', '', '\n# runs\n\ndb_file = tempfile.mkstemp(suffix=".db")[1]\n\nscale_log_file = tempfile.mkstemp()[1]\ninfo_log_file = tempfile.mkstemp()[1]\ninfo_sample_log_file = tempfile.mkstemp()[1]\n\nhs = []\nfor distance_id, distance in distances.items():\n    print(distance_id)\n    if isinstance(distance, AdaptivePNormDistance):\n        distance.scale_log_file = f"{scale_log_file}_{distance_id}.json"\n    if isinstance(distance, InfoWeightedPNormDistance):\n        distance.info_log_file = f"{info_log_file}_{distance_id}.json"\n        distance.info_sample_log_file = f"{info_sample_log_file}_{distance_id}"\n\n    abc = pyabc.ABCSMC(model, prior, distance, population_size=pop_size)\n    h = abc.new(db="sqlite:///" + db_file, observed_sum_stat=data)\n    abc.run(max_total_nr_simulations=total_sims)\n    hs.append(h)\n')


# While overall all approaches would benefit from a continued analysis, the approaches L1+Ada.+MAD+StatLR+P4 and L1+Ada.+MAD+SensiLR+P4 employing scale normalization, accounting for informativeness, and using augmented regression targets, approximate the true posterior distribution best.
# Using only scale normalization captures the overall dynamics, however givese large uncertainties, as unnecessary emphasis is put on $y_5$.
# Approaches only using $\theta$ as regression targets however fail to capture the dynamics of $\theta_4$, as the regression model cannot unravel a meaningful relationship between data and parameters.

# In[11]:


fig, axes = plt.subplots(ncols=len(prior_bounds), figsize=(16, 4))

# plot ground truth

for i_par, par in enumerate(gt_par.keys()):
    # define parameter-simulation transformation
    p_to_y = lambda p: p
    if par == "p4":
        p_to_y = lambda p: p**2
    # observed data corresponding to parameter
    y_obs = p_to_y(gt_par[par])
    # bounds
    xmin, xmax = prior_bounds[par]
    # standard deviation
    sigma = sigmas[par]

    # pdf as function of only p
    pdf = partial(
        unnorm_1d_normal_pdf, y_obs=y_obs, sigma=sigma, p_to_y=p_to_y,
    )

    # integrate density
    norm = sp.integrate.quad(pdf, xmin, xmax)[0]

    # plot density
    xs = np.linspace(xmin, xmax, 300)
    axes[i_par].plot(
        xs, pdf(xs) / norm, linestyle="dashed",
        color="grey",label="ground truth",
    )

# plot ABC approximations

for i_par, par in enumerate(prior_bounds.keys()):
    for distance_id, h in zip(distances.keys(), hs):
        pyabc.visualization.plot_kde_1d_highlevel(
            h, x=par, xname=par,
            xmin=prior_bounds[par][0], xmax=prior_bounds[par][1],
            ax=axes[i_par], label=distance_id,
            kde=pyabc.GridSearchCV() if par == "p4" else None,
            numx=500,
        )

# prettify
for ax in axes[1:]:
    ax.set_ylabel(None)
fig.tight_layout(rect=(0, 0.1, 1, 1))
axes[-1].legend(bbox_to_anchor=(1, -0.2), loc="upper right", ncol=len(distances)+1)


# While the scale weights accurately depict the scales the various model output types vary on, the sensitivity weights are high for $y_1$ through $y_4$, with low weights assigned to $y_5$. Very roughly, the sum of sensitivity weights for the four model outputs $y_3$ is roughly equal to e.g. the sensitivity weigh assigned to $y_2$, as desirable. However, the weights assigned are now completely homogeneous, indicating that an increased training sample or more complex regression model may be preferable.

# In[12]:


# plot weights

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(8, 8))

# scale weights

scale_distance_ids = [
    distance_id for distance_id in distances.keys()
    if "Ada." in distance_id and "Stat" not in distance_id
]
scale_log_files = []
for i_dist, distance_id in enumerate(scale_distance_ids):
    scale_log_files.append(f"{scale_log_file}_{distance_id}.json")

pyabc.visualization.plot_distance_weights(
    scale_log_files,
    labels=scale_distance_ids,
    colors=[colors[distance_id] for distance_id in scale_distance_ids],
    xlabel="Model output",
    title="Scale weights",
    ax=axes[0],
    keys=dict2arrlabels(data, keys=data.keys()),
)

# info weights

info_distance_ids = [
    distance_id for distance_id in distances.keys()
    if "Sensi" in distance_id
]
info_log_files = []
for i_dist, distance_id in enumerate(info_distance_ids):
    info_log_files.append(f"{info_log_file}_{distance_id}.json")

pyabc.visualization.plot_distance_weights(
    info_log_files,
    labels=info_distance_ids,
    colors=[colors[distance_id] for distance_id in info_distance_ids],
    xlabel="Model output",
    title="Sensitivity weights",
    ax=axes[1],
    keys=dict2arrlabels(data, keys=data.keys()),
)

fig.tight_layout()


# In[13]:


# plot flow diagram

fig = pyabc.visualization.plot_sensitivity_sankey(
    info_sample_log_file=f"{info_sample_log_file}_L1+Ada.+MAD+SensiLR+P4",
    t=f"{info_log_file}_L1+Ada.+MAD+SensiLR+P4.json",
    h=hs[-1],
    predictor=LinearPredictor(),
    par_trafo=ParTrafo(trafos=par_trafos),
    height=900,
)

# here just showing a non-interactive plot to reduce storage
img_file = tempfile.mkstemp(suffix=(".svg"))[1]
fig.write_image(img_file)
display(SVG(img_file))