Probabilistic Numerical Methods

Probabilistic numerical methods solve numerical problems from linear algebra, optimization, quadrature and differential equations using probabilistic inference. This approach captures uncertainty arising from finite computational resources and from stochastic input.

Even though PN methods return random variables which in their distribution represent this aforementioned uncertainty, they do not necessarily make use of random numbers. In order to illustrate what PN methods are and how they work consider the following deliberately simple numerical problem.

In [1]:
# Make inline plots vector graphics instead of raster graphics
%matplotlib inline
from IPython.display import set_matplotlib_formats

set_matplotlib_formats("pdf", "svg")

# Plotting
import matplotlib.pyplot as plt

plt.style.use("../probnum.mplstyle")

A Simple Numerical Problem

Consider the following one-dimensional optimization problem $$\underset{x \in \mathbb{R}}{\operatorname{min}} f(x) = \underset{x \in \mathbb{R}}{\operatorname{min}} \frac{1}{2} ax^2 + bx + c,$$ where $f$ is a quadratic function and we assume $a > 0$. Since $f$ is a strictly convex function there exists a unique minimum given by $$f'(x_*) = 0 \iff x_*=-\frac{b}{a}.$$

In [2]:
import numpy as np
import probnum as pn

# Random number generator -- fix a seed.
rng = np.random.default_rng(seed=1)

# Quadratic objective function
a = 2
b = -1.0
c = 3.0
x_opt = -b / a


def f(x):
    return 0.5 * a * x ** 2 + b * x + c

Now suppose we are not given access to the coefficients, but only the ability to evaluate $f$ at arbitrary points. We can then evaluate the objective three times and solve the resulting $3 \times 3$ linear system to find the parameters $a$, $b$ and $c$.

Introducing Noise

However, what happens if we go one step further and assume the parameters $(a,b,c)$ and therefore function evaluations are corrupted by noise? This is often the case in practice where $f$ might describe a complex physical system or depend on data. Can we still design a computationally efficient algorithm which finds the minimum?

Suppose we only have access to noisy evaluations $$\hat{y} = \hat{f}(x) = \frac{1}{2}(a + \varepsilon_a)x^2 + (b+\varepsilon_b)x + (c + \varepsilon_c)$$ of the objective function $f$ at $x$, where $\varepsilon = (\varepsilon_a, \varepsilon_b, \varepsilon_c) \sim \mathcal{N}(0, \Lambda)$ and $\Lambda \in \mathbb{R}^{3 \times 3}$ symmetric positive definite.

Remark: The $n$-dimensional analogue of this case arises for example in supervised (deep) learning. In large-scale empirical risk minimization the available training data often does not fit into memory. Therefore during training only batches of data are considered which induces noise on the objective function and its gradient. Here $\hat{f}$ is analogous to the empirical risk on a gvien batch.

In [3]:
from probnum import randvars

# Noisy objective function
Lambda = np.diag([0.1, 0.05, 0.01])
eps = randvars.Normal(np.zeros(3), Lambda)


def f_hat(x, rng, noise=None):
    if noise is None:
        noise = eps.sample(rng=rng)
    return 0.5 * (a + noise[0]) * x ** 2 + (b + noise[1]) * x + c + noise[2]
In [4]:
"""Plot objective and noisy function evaluations."""
from mpl_toolkits.mplot3d import Axes3D

# Plot setup
n_samples = 20
xx = np.linspace(x_opt - 1, x_opt + 1, 1000)

fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(ncols=2, nrows=2, width_ratios=[5, 1], height_ratios=[2.5, 1])

ax00 = fig.add_subplot(gs[0, 0])
ax10 = fig.add_subplot(gs[1, 0], sharex=ax00)
ax01 = fig.add_subplot(gs[0, 1], sharey=ax00)
ax11 = fig.add_subplot(gs[1, 1], projection="3d")
ylim00 = [2.25, 3.75]
ylim10 = [-0.25, 1.1]

# Function
ax00.plot(xx, f(xx), label="$f$", color="C0")

# Parameters
ax11.scatter(a, b, c, color='C0', label="$(a, b, c)$")

# Function range
ax01.scatter(ylim10[1] + ylim10[0], f(x_opt), label="$\min f(x)$")
ax01.hlines(y=f(x_opt), xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0")

# Domain
ax10.scatter(x_opt, 0, label="$\operatorname{argmin} f(x)$")
ax10.vlines(x_opt, ymin=0, ymax=ylim10[1], zorder=-1, color="C0")

# Samples
for i in range(n_samples):
    noise = eps.sample(rng=rng)
    x_opt_noise = -(b + noise[1]) / (a + noise[0])
    if i == 0:
        sample_label = "$\\hat{f}_i$"
        sample_min_label = "$\min \\hat{f}_i$"
        sample_argmin_label = "$\operatorname{argmin} \hat{f}_i(x)$"
    else:
        sample_label = None
        sample_argmin_label = None
        sample_min_label = None
    ax00.plot(xx, f_hat(xx, rng, noise), color="C0", alpha=0.2, label=sample_label)
    ax10.scatter(x_opt_noise, 0, color="C0", alpha=0.2, label=sample_argmin_label)
    ax01.scatter(
        ylim10[1] + ylim10[0],
        f_hat(x_opt_noise, rng, noise),
        color="C0",
        alpha=0.2,
        label=sample_min_label,
    )
    ax11.scatter(a + noise[0], b + noise[1], c + noise[2], color='C0', alpha=0.2)

# Adjust axis visibility, labels and legends
ax00.get_xaxis().set_visible(False)
ax10.get_yaxis().set_visible(False)
ax01.get_yaxis().set_visible(False)
ax01.get_xaxis().set_visible(False)
ax11.get_xaxis().set_ticklabels([])
ax11.get_yaxis().set_ticklabels([])
ax11.get_zaxis().set_ticklabels([])

ax00.set_ylim(ylim00)
ax00.set_xlim([x_opt - 1, x_opt + 1])
ax10.set_ylim(ylim10)
ax01.set_xlim(ylim10)
ax11.set_xlim([a - 1, a + 1])
ax11.set_ylim([b - 1, b + 1])
ax11.set_zlim([c - 1, c + 1])
ax00.set_ylabel("$y$")
ax10.set_xlabel("$x$")
ax11.set_xlabel("$a$", labelpad=-12)
ax11.set_ylabel("$b$", labelpad=-12)
ax11.set_zlabel("$c$", labelpad=-12)
ax00.legend(loc="upper center")
ax01.legend(loc="upper center", fontsize=10, labelspacing=0)
ax10.legend(loc="upper right", fontsize=10, labelspacing=0)

plt.show()
2021-08-11T16:12:32.663141 image/svg+xml Matplotlib v3.4.2, https://matplotlib.org/

This makes the optimization problem considerably more difficult depending on the degree of noise. Can we still come up with a computationally efficient and accurate algorithm that makes use of the information we have about the problem?

In fact, we can. We will design a custom probabilistic numerical method.

Components of PN Methods

We will now build up a probabilistic numerical method from its individual components to solve the given optimization problem. In a nutshell a PN method consists of prior knowledge about the problem (e.g. differentiability), a policy, which returns an action that collects information about the problem resulting in observations. Our prior belief is updated with the observations about the problem using statistical inference techniques leading to a posterior belief over the numerical quantity in question. Finally, a stopping criterion determines, when to terminate.

Prior Knowledge

A naive strategy to solve the problem given only the ability to evaluate the noisy function $\hat{f}$ could be to simply choose random points on the real axis and return the point of lowest observed function value. However, clearly this is suboptimal since we might get unlucky and observe a low function value for an $x$ far away from the minimum simply by chance. Also crucially it completely ignores information we have and collect about the problem. For example, we know the latent function is quadratic.

Further, we might have some idea of where the minimum of the function $f$ lies. This could come from experience, a physical application from which the problem arises or related optimization problems we have solved in the past.

We can combine these two forms of prior information, inherent properties of the problem and a belief over its solution, into a prior distribution over the coefficients of the quadratic function. This induces a belief over the optimal function value $f(x_*)$. Here, we assume that we are reasonably certain about the value of $a$ but less so about $b$ and $c$.

In [5]:
# Prior on parameters: a, b, c
mu = np.array([1.8, 0.0, 3.5])
Sigma = np.array([[0.1, 0.0, 0.0],
                  [0.0, 0.4, 0.0],
                  [0.0, 0.0, 0.2]])
fun_params0 = randvars.Normal(mean=mu, cov=Sigma)
x0 = -fun_params0.mean[1] / fun_params0.mean[0]

# Induced belief over optimal function value
def fun_opt0(fun_params0, x):
    x = np.asarray(x).reshape(-1, 1)
    return np.hstack((0.5 * x ** 2, x, np.ones_like(x))) @ fun_params0
In [6]:
"""Plot objective, noisy function evaluations and prior knowledge."""

# Evaluate some quantities to plot
yy = np.linspace(ylim00[0], ylim00[1], 1000)
fun_opt0_mean = fun_opt0(fun_params0, xx).mean
fun_opt0_std = np.sqrt(fun_opt0(fun_params0, xx).var)
fun0_pdf = fun_opt0(fun_params0, x0).pdf(yy)
fun0_pdf_max = np.max(fun0_pdf)

# Plot setup
fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(ncols=2, nrows=2, width_ratios=[5, 1], height_ratios=[2.5, 1])

ax00 = fig.add_subplot(gs[0, 0])
ax10 = fig.add_subplot(gs[1, 0], sharex=ax00)
ax01 = fig.add_subplot(gs[0, 1], sharey=ax00)
ax11 = fig.add_subplot(gs[1, 1], projection="3d")

# Function
ax00.plot(xx, f(xx), label="$f$", color="C0")
ax00.plot(xx, fun_opt0_mean, color="C1", label="$p(f)$")
ax00.fill_between(
    x=xx,
    y1=fun_opt0_mean - 2 * fun_opt0_std,
    y2=fun_opt0_mean + 2 * fun_opt0_std,
    color="C1",
    alpha=0.2,
)

# Parameters
def gaussian_credible_region_3d(mean, cov, alpha=0.95):
    """Compute the x,y,z coordinates of a 3D Gaussian credible region."""
    from scipy.stats import chi2
    
    # Determine sphere radius (https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Interval)
    dim = 3
    radius = np.sqrt(chi2.ppf(alpha, df=dim))
    
    # Cholesky decomposition
    L = np.linalg.cholesky(cov)
    
    # Spheroid using polar coordinates
    u = np.linspace(0.0, 2.0 * np.pi, 100)
    v = np.linspace(0.0, np.pi, 100)
    x = radius * np.outer(np.cos(u), np.sin(v))
    y = radius * np.outer(np.sin(u), np.sin(v))
    z = radius * np.outer(np.ones_like(u), np.cos(v))

    # Affine transformation
    for i in range(len(x)):
        for j in range(len(x)):
            [x[i,j],y[i,j],z[i,j]] = np.dot(L, [x[i,j],y[i,j],z[i,j]]) + mean

    return x, y, z

x, y, z = gaussian_credible_region_3d(mean=fun_params0.mean, cov=fun_params0.cov)
ax11.plot_surface(x, y, z, color='C1', alpha=0.3, linewidth=1)
ax11.scatter(a, b, c, color='C0')

# Function range
ax01.scatter(ylim10[1] + ylim10[0], f(x_opt), label="$\min f(x)$")
ax01.hlines(y=f(x_opt), xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0")
ax01.plot(ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max, yy, color="C1", label="$p(f_*)$")
ax01.fill_betweenx(
    y=yy,
    x1=ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max,
    x2=ylim10[1] + ylim10[0],
    color="C1",
    alpha=0.1,
)

# Domain
ax10.scatter(x_opt, 0, label="$\operatorname{argmin} f(x)$")
ax10.vlines(x_opt, ymin=0, ymax=ylim10[1], zorder=-1, color="C0")
ax10.scatter(x0, 0, color="C1", label="$- \\frac{\\mathbb{E}[b]}{\\mathbb{E}[a]}$")
ax10.vlines(x0, ymin=0, ymax=ylim10[1], zorder=-1, color="C1")

# Samples
for i in range(n_samples):
    noise = eps.sample(rng=rng)
    x_opt_noise = -(b + noise[1]) / (a + noise[0])
    if i == 0:
        sample_label = "$\\hat{f}_i$"
        sample_min_label = "$\min \\hat{f}_i$"
        sample_argmin_label = "$\operatorname{argmin} \hat{f}_i(x)$"
    else:
        sample_label = None
        sample_argmin_label = None
        sample_min_label = None
    ax00.plot(xx, f_hat(xx, rng, noise), color="C0", alpha=0.2, label=sample_label)
    ax10.scatter(x_opt_noise, 0, color="C0", alpha=0.2, label=sample_argmin_label)
    ax01.scatter(
        ylim10[1] + ylim10[0],
        f_hat(x_opt_noise, rng, noise),
        color="C0",
        alpha=0.2,
        label=sample_min_label,
    )
    ax11.scatter(a + noise[0], b + noise[1], c + noise[2], color='C0', alpha=0.2)

# Adjust axis visibility, labels and legends
ax00.get_xaxis().set_visible(False)
ax10.get_yaxis().set_visible(False)
ax01.get_yaxis().set_visible(False)
ax01.get_xaxis().set_visible(False)
ax11.get_xaxis().set_ticklabels([])
ax11.get_yaxis().set_ticklabels([])
ax11.get_zaxis().set_ticklabels([])

ax00.set_ylim(ylim00)
ax00.set_xlim([x_opt - 1, x_opt + 1])
ax10.set_ylim(ylim10)
ax01.set_xlim(ylim10)
ax11.set_xlim([a - 1, a + 1])
ax11.set_ylim([b - 1, b + 1])
ax11.set_zlim([c - 1, c + 1])
ax00.set_ylabel("$y$")
ax10.set_xlabel("$x$")
ax11.set_xlabel("$a$", labelpad=-12)
ax11.set_ylabel("$b$", labelpad=-12)
ax11.set_zlabel("$c$", labelpad=-12)
ax00.legend(loc="upper center")
ax01.legend(loc="upper center", fontsize=10, labelspacing=0)
ax10.legend(loc="upper right", fontsize=10, labelspacing=0)

plt.show()
2021-08-11T16:12:36.353767 image/svg+xml Matplotlib v3.4.2, https://matplotlib.org/