#!/usr/bin/env python
# coding: utf-8

# # ECDF confidence bands
# 
# This notebook implements the ECDF confidence envelope methods of [Säilynoja et al, 2022](https://doi.org/10.1007/s11222-022-10090-6).
# The paper presents 2 approaches for adjusting the pointwise (i.e. marginal) confidence envelopes so that they contain an ECDF with the desired probability.
# One approach estimates this adjustment using Monte Carlo simulation of ECDFs from the assumed sampling distribution.
# The other uses optimization to compute the adjustment with analytic computation of the probability of an ECDF from the assumed distribution being within the confidence band.
# 
# While the paper considers the case when the target distribution is uniform, the implementations here also work for other discrete and continuous distributions, so long as the user can provide the CDF of the assumed distribution (or, when simulating, a sampling function).
# 
# ## Notes for the generalized implementations
# 
# Consider an ECDF $\hat{F}(x)$ computed from $n$ IID points $\{x_i\}_{i=1}^n$ whose sampling distribution has CDF $F(x)$.
# Let $\{z_j\}_{j=1}^J$ be points at which we will evaluate the ECDF, and let $\hat{F}_j = \hat{F}(z_j)$ and $F_j = F(z_j)$ be the ECDF and CDF, respectively, at the evaluation points.
# Furthermore, let $\hat{r}_j = n\hat{F}_j \in \{0, \ldots, n\}$ be the corresponding counts.
# 
# ### Pointwise confidence bands
# 
# By definition, if $x \sim F$, then $P(x \le z_j) = F_j$. 
# As a result, $I(x \le z_j) \sim \mathrm{Bernoulli}(F_j)$, where $I(\cdot)$ is the indicator function.
# Then, $\hat{r}_j = \sum_{i=1}^n I(x_i \le z_j) \sim \mathrm{Bin}(n, F_j)$, from which we get the marginal (i.e. pointwise) distribution for the ECDF at each evaluation point.
# The pointwise lower and upper bounds are just the symmetric $\alpha$-quantiles of this distribution scaled by $n$:
# $$
# L_j(\alpha) = \frac{1}{n} \mathrm{Bin}^{-1}\left(\frac{\alpha}{2} \mid n, F_j\right)\\
# U_j(\alpha) = \frac{1}{n} \mathrm{Bin}^{-1}\left(1 - \frac{\alpha}{2} \mid n, F_j\right)\\
# P(L_j(\alpha) \le \hat{F}_j \le U_j(\alpha)) \approx 1 - \alpha.
# $$
# All simultaneous (i.e. joint) confidence bands involve finding a $\gamma$ such that the $\gamma$-level pointwise envelope contains the ECDF with probability $1-\alpha$.
# 
# ### Simulated simultaneous confidence bands
# 
# This method proceeds just as in the paper.
# We simulate $m$ ECDFs from the supposed distribution with CDF $F$ and find the pointwise level for the tightest band that contains each ECDF.
# Then $\gamma$ is estimated as the the $\alpha$-quantile of these levels.
# The only difference between our approach and the paper's is that we don't assume the distribution is uniform on the unit interval.
# 
# ### Optimized simultaneous confidence bands
# 
# For the optimized confidence band, we follow the same logic as the paper.
# Let $I_j(\gamma)$ be set of points in the $\gamma$ confidence band of $\hat{r}_j$ at $z_j$.
# We take $\hat{r}_0 = 0$ and $I_0(\gamma) = \{0\}$.
# 
# Given the value $\hat{r}_j$, we want to determine the conditional probability of $\hat{r}_{j+1}$.
# While the paper assumes the distribution is uniform on the unit interval, we let $x \sim F$, and then
# $$P(z_j < x \le z_{j+1} \mid x > z_j) = \frac{F_{j+1} - F_j}{1 - F_j}.$$
# Furthermore, we have $n - \hat{r}_{j}$ remaining "trials".
# So the conditional distribution of the difference is then given by
# $$(\hat{r}_{j+1} - \hat{r}_j) \mid \hat{r}_{j} \sim \mathrm{Bin}\left(n - \hat{r}_j, \frac{F_{j+1} - F_j}{1 - F_j}\right),$$
# and we have the conditional probability
# $$P(\hat{r}_{j+1}=k_{j+1} \mid \hat{r}_j = k_j) = \mathrm{Bin}\left(k_{j+1} - k_j \mid n - k_j, \frac{F_{j+1} - F_j}{1 - F_j}\right).$$
# 
# This can be used to write the joint probability that the entire ECDF is contained within the $\gamma$-level envelope:
# $$p(\gamma) = P(\hat{r}_1 \in I_1(\gamma), \ldots, \hat{r}_J \in I_J(\gamma)) = \sum_{k_0 \in I_0(\gamma), k_1 \in I_1(\gamma), \ldots, k_J \in I_J(\gamma)} \prod_{j=1}^J P(\hat{r}_j = k_j \mid \hat{r}_{j-1} = k_{j-1}).$$
# 
# This looks intimidating but is essentially a sequence of matrix-vector products, where the vectors contain joint probabilities and the matrices contain conditional probabilities.
# To see this, let 
# $$v_j(\gamma) = [P(\hat{r}_1 \in I_1(\gamma), \ldots, \hat{r}_{j-1} \in I_{j-1}(\gamma), \hat{r}_j = k_j)]_{k_j \in I_j(\gamma)},$$
# which is a vector of joint probabilities of each value in the confidence interval at $z_j$ conditioned on $z_j$ being in an ECDF that has up until $z_j$ been contained within the $\gamma$-level pointwise envelope.
# Similarly, given row "indices" $k_{j+1} \in I_{j+1}(\gamma)$ and column "indices" $k_j \in I_j(\gamma)$, we define the matrix of conditional probabilities
# $$R_{j+1}(\gamma) = [P(\hat{r}_{j+1} = k_{j+1} \mid \hat{r}_j = k_j)]_{k_{j+1} \in I_{j+1}(\gamma), k_j \in I_j(\gamma)}.$$
# Then we have the recursion $v_{j+1}(\gamma) = R_{j+1}(\gamma) v_j(\gamma)$, so that $p(\gamma) = 1_{|I_J(\gamma)|}^\top R_{J}(\gamma) R_{J-1}(\gamma) \cdots R_2(\gamma) R_1(\gamma)$.
# 
# Just as in the paper, we can then optimize $\gamma \in (0, \alpha)$ to minimize $|p(\gamma) - (1 - \alpha)|$.
# 
# ## Implementations
# 
# First, we have the general purpose methods for computing an ECDF and pointwise confidence envelope.

# In[1]:


from typing import Tuple

import numpy as np
import scipy.stats


def compute_ecdf(sorted_points: np.ndarray, eval_points: np.ndarray) -> np.ndarray:
    """Evaluate ECDF of the `sorted_points` at the `eval_points`."""
    return np.searchsorted(sorted_points, eval_points, side="right") / len(
        sorted_points
    )


def get_pointwise_confidence_band(
    prob: float, num_draws: int, cdf_at_eval_points: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
    """Compute the `prob`-level pointwise confidence band."""
    prob_lower_tail = (1 - prob) / 2
    prob_upper_tail = 1 - prob_lower_tail
    prob_tails = np.array([[prob_lower_tail, prob_upper_tail]]).T
    prob_lower, prob_upper = (
        scipy.stats.binom.ppf(prob_tails, num_draws, cdf_at_eval_points) / num_draws
    )
    return prob_lower, prob_upper


# The following implements confidence band estimation using simulation.

# In[2]:


def simulate_ecdf(
    eval_points: np.ndarray, num_draws: int, rvs, random_state=None
) -> np.ndarray:
    """Simulate ECDF at the `eval_points` using the given random variable sampler"""
    sample = rvs(num_draws, random_state=random_state)
    sample.sort()
    return compute_ecdf(sample, eval_points)


def fit_pointwise_band_probability(
    cdf_at_eval_points: np.ndarray,
    ecdf_at_eval_points: np.ndarray,
    num_draws: int,
) -> float:
    """Compute the smallest marginal probability of a pointwise confidence band that
    contains the ECDF."""
    ecdf_scaled = (num_draws * ecdf_at_eval_points).astype(int)
    prob_lower_tail = np.amin(
        scipy.stats.binom.cdf(ecdf_scaled, num_draws, cdf_at_eval_points)
    )
    prob_upper_tail = np.amin(
        scipy.stats.binom.sf(ecdf_scaled - 1, num_draws, cdf_at_eval_points)
    )
    prob_pointwise = 1 - 2 * min(prob_lower_tail, prob_upper_tail)

    return prob_pointwise


def simulate_simultaneous_band_probability(
    eval_points: np.ndarray,
    cdf_at_eval_points: np.ndarray,
    num_draws: int,
    rvs,
    prob_target: float = 0.95,
    num_trials: int = 1000,
    random_state=None,
):
    """Estimate probability for simultaneous confidence band using simulation.

    This function simulates the pointwise probability needed to construct pointwise
    confidence bands that form a `prob_target`-level confidence envelope for the ECDF
    of a sample.
    """
    probs = []
    for _ in range(num_trials):
        ecdf_at_eval_points = simulate_ecdf(
            eval_points, num_draws, rvs, random_state=random_state
        )
        prob = fit_pointwise_band_probability(
            cdf_at_eval_points,
            ecdf_at_eval_points,
            num_draws,
        )
        probs.append(prob)

    prob_pointwise = np.quantile(probs, prob_target)
    return prob_pointwise


# Then confidence band estimation using optimization.

# In[3]:


from scipy.optimize import minimize_scalar


def update_interior_probabilities(
    prob_left: np.ndarray,
    interval_left: np.ndarray,
    interval_right: np.ndarray,
    cdf_left: float,
    cdf_right: float,
    n,
) -> np.ndarray:
    """Update the probability that an ECDF has been within the envelope including at the current point.

    Parameters
    ----------
    - `prob_left`: for each point in the interior at the previous point, the joint probability that
       it and all points before are in the interior.
    - `interval_left`: the set of points in the interior at the previous point.
    - `interval_right`: the set of points in the interior at the current point.
    - `cdf_left`: the CDF at the previous point.
    - `cdf_right`: the CDF at the current point.
    - `n`: the number of samples.

    Returns
    -------
    - `prob_right`: for each point in the interior at the current point, the joint probability that
       it and all previous points are in the interior.
    """
    z_tilde = (cdf_right - cdf_left) / (1 - cdf_left)
    interval_left = interval_left[:, np.newaxis]
    prob_conditional = scipy.stats.binom.pmf(
        interval_right, n - interval_left, z_tilde, loc=interval_left
    )
    prob_right = prob_left.dot(prob_conditional)
    return prob_right


def band_optimization_objective(
    prob_pointwise: float,
    cdf_at_eval_points: np.ndarray,
    num_draws: int,
    prob_target: float,
) -> float:
    """Objective function for optimizing the simultaneous confidence band probability."""
    lower, upper = get_pointwise_confidence_band(
        prob_pointwise, num_draws, cdf_at_eval_points
    )
    lower = (lower * num_draws).astype(int)
    upper = (upper * num_draws).astype(int)

    interval_left = np.zeros(1)
    cdf_left = 0
    prob_interior = np.ones(1)
    for i in range(cdf_at_eval_points.shape[0]):
        interval_right = np.arange(lower[i], upper[i] + 1)
        cdf_right = cdf_at_eval_points[i]
        prob_interior = update_interior_probabilities(
            prob_interior, interval_left, interval_right, cdf_left, cdf_right, num_draws
        )
        interval_left = interval_right
        cdf_left = cdf_right
    prob_interior = prob_interior.sum()
    return abs(prob_interior - prob_target)


def optimize_simultaneous_band_probability(
    cdf_at_eval_points: np.ndarray,
    num_draws: int,
    prob_target: float = 0.9,
) -> float:
    cdf_at_eval_points = np.unique(cdf_at_eval_points)
    objective = lambda p: band_optimization_objective(
        p,
        cdf_at_eval_points,
        num_draws,
        prob_target,
    )
    prob_pointwise = minimize_scalar(
        objective, bounds=(prob_target, 1), method="bounded"
    ).x
    return prob_pointwise


# Finally, we just wrap these in a single interface function.

# In[4]:


def confidence_band_probability(
    eval_points: np.ndarray,
    cdf_at_eval_points: np.ndarray,
    num_draws: int,
    method: str = None,
    prob_target: float = 0.95,
    **kwargs,
) -> float:
    if method is None:
        method = "optimized"

    if method == "pointwise":
        prob_pointwise = prob_target
    elif method == "optimized":
        prob_pointwise = optimize_simultaneous_band_probability(
            cdf_at_eval_points,
            num_draws,
            prob_target=prob_target,
        )
    elif method == "simulated":
        rvs = kwargs.pop("rvs")
        prob_pointwise = simulate_simultaneous_band_probability(
            eval_points,
            cdf_at_eval_points,
            num_draws,
            rvs,
            prob_target=prob_target,
            **kwargs,
        )
    else:
        raise ValueError(
            f"Unknown method {method}. Valid options are 'pointwise', 'optimized', and "
            "'simulated'."
        )

    return prob_pointwise


def ecdf_confidence_band(
    eval_points: np.ndarray,
    cdf_at_eval_points: np.ndarray,
    num_draws: int,
    prob_target: float = 0.95,
    method: str = None,
    **kwargs,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute the `prob`-level confidence band for the ECDF using the specified `method`."""
    prob_pointwise = confidence_band_probability(
        eval_points,
        cdf_at_eval_points,
        num_draws,
        prob_target=prob_target,
        method=method,
        **kwargs,
    )
    prob_lower, prob_upper = get_pointwise_confidence_band(
        prob_pointwise,
        num_draws,
        cdf_at_eval_points,
    )
    return prob_lower, prob_upper


# ## Checks
# 
# First, let's visually inspect the simulated (orange) and optimized(black) 99% confidence bands for a few distributions and 100 sample ECDFs.
# We expect ~1 ECDF to exit the band in each subplot.
# Since it's more visually informative, let's look at the difference ECDF (obtained by subtracting the true CDF from the ECDF and bands).

# In[5]:


import matplotlib.pyplot as plt

num_draws = 1000
num_ecdfs = 100
num_points = 100
prob_target = 0.99

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
axs = axs.ravel()
random_state = np.random.RandomState(0)

dists = [
    ("Uniform(0, 1)", scipy.stats.uniform(0, 1)),
    ("Normal(0, 1)", scipy.stats.norm(0, 1)),
    ("Log-Normal(0, 1)", scipy.stats.lognorm(loc=0, s=1)),
    ("Poisson(100)", scipy.stats.poisson(100)),
]

for i, (label, dist) in enumerate(dists):
    eval_points = np.linspace(*dist.interval(0.999), num_points)
    cdf = dist.cdf(eval_points)
    lower_opt, upper_opt = ecdf_confidence_band(
        eval_points, cdf, num_draws, prob_target=prob_target, method="optimized"
    )
    lower_sim, upper_sim = ecdf_confidence_band(
        eval_points,
        cdf,
        num_draws,
        prob_target=prob_target,
        method="simulated",
        rvs=dist.rvs,
        random_state=random_state,
        num_trials=10_000,
    )
    axs[i].step(eval_points, lower_opt - cdf, color="black", where="post")
    axs[i].step(eval_points, upper_opt - cdf, color="black", where="post")

    axs[i].step(eval_points, lower_sim - cdf, color="orange", where="post")
    axs[i].step(eval_points, upper_sim - cdf, color="orange", where="post")
    for j in range(num_ecdfs):
        sample = dist.rvs(num_draws, random_state=random_state)
        sample.sort()
        ecdf = compute_ecdf(sample, eval_points)
        ecdf_diff = ecdf - cdf
        axs[i].step(eval_points, ecdf_diff, color="C0", alpha=0.1, where="post")
    axs[i].set_title(label)


# Great, they look the same! Since the optimization method is faster and more numerically stable, we'll focus on it for all subsequent checks.
# 
# Let's define some functions for numerically checking the bands.
# We'll check the optimized envelope for a log-normal distribution (it's more interesting than uniform or normal because it has heavy tails and is asymmetric).

# In[6]:


import tqdm


def check_envelope_coverage_independent_eval_points(
    dist,
    num_draws,
    prob_target,
    num_points=100,
    random_state=None,
    num_trials=10_000,
    prob_estimate=0.99,
):
    eval_points = np.linspace(*dist.interval(0.999), num_points)
    lower, upper = ecdf_confidence_band(
        eval_points,
        dist.cdf(eval_points),
        num_draws,
        prob_target=prob_target,
        method="optimized",
    )

    in_envelope = []
    for i in tqdm.tqdm(range(num_trials)):
        sample = dist.rvs(num_draws, random_state=random_state)
        sample.sort()
        sample_ecdf = compute_ecdf(sample, eval_points)
        in_envelope.append(np.all((lower <= sample_ecdf) & (sample_ecdf <= upper)))

    mean_asymptotic_dist = scipy.stats.norm(
        np.mean(in_envelope), scipy.stats.sem(in_envelope)
    )
    print(
        f"{100*prob_estimate}% confidence interval for the probability of the ECDF being in the "
        f"envelope: {mean_asymptotic_dist.interval(prob_estimate)}"
    )


def check_envelope_coverage_dependent_eval_points(
    dist,
    num_draws,
    prob_target,
    num_points=None,
    scale=1,
    random_state=None,
    num_trials=10_000,
    prob_estimate=0.99,
):
    in_envelope = []
    for i in tqdm.tqdm(range(num_trials)):
        sample = dist.rvs(num_draws, random_state=random_state)
        sample.sort()

        if num_points is None:
            eval_points = sample
        else:
            span = sample[-1] - sample[0]
            padding = span * (scale - 1) / 2
            eval_points = np.linspace(
                sample[0] - padding, sample[-1] + padding, num_points
            )

        lower, upper = ecdf_confidence_band(
            eval_points,
            dist.cdf(eval_points),
            num_draws,
            prob_target=prob_target,
            method="optimized",
        )

        sample_ecdf = compute_ecdf(sample, eval_points)
        in_envelope.append(np.all((lower <= sample_ecdf) & (sample_ecdf <= upper)))

    mean_asymptotic_dist = scipy.stats.norm(
        np.mean(in_envelope), scipy.stats.sem(in_envelope)
    )
    print(
        f"{100*prob_estimate}% confidence interval for the probability of the ECDF being in the "
        f"envelope: {mean_asymptotic_dist.interval(prob_estimate)}"
    )


dist = scipy.stats.lognorm(loc=0, s=1)
prob_target = 0.85
num_draws = 100
num_points = 100


# In the paper, they chose the evaluation points to be independent of the actual sample points.
# We'll arbitrarily set them to be uniformly spaced throughout the 99.9% central interval.
# Because this is very fast, we'll estimate with many trials to get a tighter confidence bound.

# In[7]:


check_envelope_coverage_independent_eval_points(
    dist,
    num_draws,
    prob_target,
    num_points=num_points,
    random_state=np.random.RandomState(834),
    num_trials=100_000,
)


# Let's also check a discrete distribution.

# In[8]:


check_envelope_coverage_independent_eval_points(
    scipy.stats.binom(100, 0.1),
    num_draws,
    0.85,
    num_points=num_points,
    random_state=np.random.RandomState(262),
    num_trials=100_000,
)


# Both look good! And this has the added bonus that if we need the same band for multiple plots, we can compute it once up-front, which is the expensive part, and then reuse it for all plots.
# 
# What if we chose the evaluation points to be the same as the sample points?

# In[9]:


check_envelope_coverage_dependent_eval_points(
    dist,
    num_draws,
    prob_target,
    random_state=np.random.RandomState(834),
    num_trials=10_000,
)


# The fraction of genuine ECDFs that fall in the interior is slightly lower than expected.
# So if we use the sample points as the evaluation points, we may incorrectly reject a perfectly fine ECDF.
# But it's pretty close, and at least it errs on the side of caution.
# 
# Lastly, what if we uniformly space the evaluation points between the sample extrema?

# In[10]:


check_envelope_coverage_dependent_eval_points(
    dist,
    num_draws,
    prob_target,
    num_points=num_points,
    random_state=np.random.RandomState(834),
    num_trials=10_000,
)


# 
# It's slightly better, but the interval still doesn't contain the target probability.
# 
# Lastly, what if we uniformly space the evaluation points over an interval 10% larger than the interval containing the samples?

# In[11]:


check_envelope_coverage_dependent_eval_points(
    dist,
    num_draws,
    prob_target,
    num_points=num_points,
    scale=1.1,
    random_state=np.random.RandomState(834),
    num_trials=10_000,
)


# It looks a little better, but it's hard to tell if these differences are actually meaningful or just random.