- Goals
- Introduction to Bayesian (i.e., probabilistic) modeling

- Materials
- Mandatory
- These lecture notes

- Optional
- Bishop pp. 68-74 (on the coin toss example)
- Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.35-44 (section 2.9, on deriving Bayes rule for updating probabilities)
- David Blei - 2014 - Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models, on the
*Build-Compute-Critique-Repeat*design model.

- Mandatory

**Problem**: We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$What is the probability that heads comes up next?

**Solution**: later in this lecture.

- Suppose that your application is to predict a future observation $x$, based on $N$ past observations $D=\{x_1,\dotsc,x_N\}$.

- The Bayesian design approach to solving this task involves four stages:

REPEAT

1- Model specification

2- Parameter estimation (i.e., learning from an observed data set using Bayesian inference)

3- Model evaluation (how good is this (trained) model?)

UNTIL model performance is satisfactory

4- Apply model, e.g. for prediction or classification of new data

- In principle, based on the model evaluation results, you may want to re-specify your model and
*repeat*the design process (a few times), until model performance is acceptable.

- Next, we discuss these four stages in a bit more detail.

- Your first task is to propose a probabilistic model ($m$) for generating the observations $x$.

- A probabilistic model $m$ consists of a joint distribution $p(x,\theta|m)$ that relates observations $x$ to model parameters $\theta$. Usually, the model is proposed in the form of a data generating distribution $p(x|\theta,m)$ and a prior $p(\theta|m)$.

*You*are responsible to choose the data generating distribution $p(x|\theta)$ based on your physical understanding of the data generating process. (For simplicity, we dropped the given dependency on $m$ from the notation).

*You*must also choose the prior $p(\theta)$ to reflect what you know about the parameter values before you see the data $D$.

- Note that, for a given data set $D=\{x_1,x_2,\dots,x_N\}$ with
*independent*observations $x_n$, the likelihood is $$ p(D|\theta) = \prod_{n=1}^N p(x_n|\theta)\,,$$ so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a likelihood function for the model parameters.

- The likelihood and prior both contain information about the model parameters. Next, you use Bayes rule to fuse these two information sources into a posterior distribution for the parameters, $$ \underbrace{p(\theta|D) }_{\text{posterior}}= \frac{p(D|\theta) p(\theta)}{p(D)} \propto \underbrace{p(D|\theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}} $$

- Note that there's
**no need for you to design some clever parameter estimation algorithm**. Bayes rule*is*the parameter estimation algorithm. The only complexity lies in the computational issues!

This "recipe" works only if the right-hand side (RHS) factors can be evaluated; the computational details can be quite challenging and this is what machine learning is about.

$\Rightarrow$

**Machine learning is EASY, apart from computational details :)**

- In the framework above, parameter estimation was executed by "perfect" Bayesian reasoning. So is everything settled now?

- No, there appears to be one remaining problem: how good really were our model assumptions $p(x|\theta)$ and $p(\theta)$? We want to "score" the model performance.

- Note that this question is only interesting if we have alternative models to choose from.

- Let's assume that we have more candidate models, say $\mathcal{M} = \{m_1,\ldots,m_K\}$ where each model relates to specific prior $p(\theta|m_k)$ and likelihood $p(D|\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?

- Start again with
**model specification**.*You*must now specify a prior $p(m_k)$ (next to the likelihood $p(D|\theta,m_k)$ and prior $p(\theta|m_k)$) for each of the models and then solve the desired inference problem:

$$\begin{align*} \underbrace{p(m_k|D)}_{\substack{\text{model}\\\text{posterior}}} &= \frac{p(D|m_k) p(m_k)}{p(D)} \\ &\propto p(m_k) \cdot p(D|m_k) \\ &= p(m_k)\cdot \int_\theta p(D,\theta|m_k) \,\mathrm{d}\theta\\ &= \underbrace{p(m_k)}_{\substack{\text{model}\\\text{prior}}}\cdot \underbrace{\int_\theta \underbrace{p(D|\theta,m_k)}_{\text{likelihood}} \,\underbrace{p(\theta|m_k)}_{\text{prior}}\, \mathrm{d}\theta }_{\substack{\text{evidence }p(D|m_k)\\\text{= model likelihood}}}\\ \end{align*}$$ - This procedure is called
**Bayesian model comparison**, which requires that you calculate the "evidence" (= model likelihood).

- $\Rightarrow$ In a Bayesian framework,
**model estimation**follows the same recipe as parameter estimation; it just works at one higher hierarchical level. Compare the required calulations:

- In principle, you could proceed with asking how good your choice for the candidate model set $\mathcal{M}$ was. You would have to provide a set of alternative model sets $\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_M\}$ with priors $p(\mathcal{M}_m)$ for each set and compute posteriors $p(\mathcal{M}_m|D)$. And so forth ...

- With the (relative) performance evaluation scores of your model in hand, you could now re-specify your model (hopefully an improved model) and
*repeat*the design process until the model performance score is acceptable.

- As an aside, in the (statistics and machine learning) literature, performance comparison between two models is often reported by the Bayes Factor, which is defined as the ratio of model evidences:
$$\begin{align*}
\underbrace{\frac{p(D|m_1)}{p(D|m_2)}}_{\text{Bayes Factor}} &= \frac{\frac{p(D,m_1)}{p(m_1)}}{\frac{p(D,m_2)}{p(m_2)}} \\
&= \frac{p(D,m_1)}{p(m_1)} \cdot \frac{p(m_2)}{p(D,m_2)} \\
&= \frac{p(m_1|D) p(D)}{p(m_1)} \cdot \frac{p(m_2)}{p(m_2|D) p(D)} \\
&= \underbrace{\frac{p(m_1|D)}{p(m_2|D)}}_{\substack{\text{posterior} \\ \text{ratio}}} \cdot \underbrace{\frac{p(m_2)}{p(m_1)}}_{\substack{\text{prior} \\ \text{ratio}}}
\end{align*}$$
- Hence, for equal model priors ($p(m_1)=p(m_2)=0.5$), the Bayes Factor reports the posterior probability ratio for the two models.

- Once we are satisfied with the evidence for a (trained) model, we can apply the model to our prediction/classification/etc task.

- Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by (everything is conditioned on the selected model) $$\begin{align*} p(x|D) &\stackrel{s}{=} \int p(x,\theta|D) \,\mathrm{d}\theta\\ &\stackrel{p}{=} \int p(x|\theta,D) p(\theta|D) \,\mathrm{d}\theta\\ &\stackrel{m}{=} \int \underbrace{p(x|\theta)}_{\text{data generation dist.}} \cdot \underbrace{p(\theta|D)}_{\text{posterior}} \,\mathrm{d}\theta\\ \end{align*}$$

- In the last equation, the simplification $p(x|\theta,D) = p(x|\theta)$ follows from our model specification. We assumed a
*parametric*data generating distribution $p(x|\theta)$ with no explicit dependency on the data set $D$.

- Again,
**no need to invent a special prediction algorithm**. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\theta$.

- Note that the application of the learned posterior $p(\theta|D)$ not necessarily has to be a prediction task. We use it here as an example, but other applications (e.g., classification, regression etc.) are of course also possible.

- What did we learn from $D$? Without access to $D$, we would predict new observations through $$ p(x) = \int p(x,\theta) \,\mathrm{d}\theta = \int p(x|\theta) \cdot \underbrace{p(\theta)}_{\text{prior}} \,\mathrm{d}\theta $$

- When you have a posterior $p(m_k|D)$ for the models, you don't
*need*to choose one model for the prediction task. You can do prediction by**Bayesian model averaging**to utilitize the predictive power from all models: $$\begin{align*} p(x|D) &= \sum_k \int p(x,\theta,m_k|D)\,\mathrm{d}\theta \\ &= \sum_k \int p(x|\theta,m_k) \,p(\theta|m_k,D)\, p(m_k|D) \,\mathrm{d}\theta \\ &= \sum_k \underbrace{p(m_k|D)}_{\substack{\text{model}\\\text{posterior}}} \cdot \int \underbrace{p(\theta|m_k,D)}_{\substack{\text{parameter}\\\text{posterior}}} \, \underbrace{p(x|\theta,m_k)}_{\substack{\text{data generating}\\\text{distribution}}} \,\mathrm{d}\theta \end{align*}$$

- Alternatively, if you do need to work with one model (e.g. due to computational resource constraints), you can for instance select the model with largest posterior $p(m_k|D)$ and use that model for prediction. This is called
**Bayesian model selection**.

- Bayesian model evidence is well known as an excellent criterion to score the performance of models. Here, we derive a decomposition of model evidence that relates to evidence to other valued criteria such as
**accuracy**and**model complexity**.

- Consider a model $p(x,\theta|m)$ and a data set $D = \{x_1,x_2, \ldots,x_N\}$.

- Given the data set $D$, the log-evidence for model $m$ decomposes as follows:

- The first term (data fit) measures how well the model predicts the data $D$, after having learned from the data. We want this term to be large (although only focussing on this term could lead to
*overfitting*).

- The second term (complexity) quantifies the amount of information that the model absorbed through learning, i.e., by moving parameter beliefs from $p(\theta|m)$ to $p(\theta|D,m)$.
- To see this, note that the
*mutual information*between two variables $\theta$ and $D$ is defined as $$ I[\theta;D] = \iint p(\theta,D) \log \frac{p(\theta|D)}{p(\theta)} \mathrm{d}\theta \mathrm{d}D $$

- To see this, note that the

- The complexity term regularizes the Bayesian learning process automatically. If you prefer models with high Bayesian evidence, then you prefer models that get a good data fit without need to learn much from the data set. These types of models are said to
*generalize*well, since they can be applied to different data sets without specific adaptations for each data set.

- $\Rightarrow$ Bayesian learning automatically leads to models that generalize well.
**There is no need for early stopping or validation data sets**. Just learn on the data set and all behaves well.

- The Bayesian design process provides a unified framework for the Scientific Inquiry method. We can now add equations to the design loop. (Trial design to be discussed in Intelligent Agent lesson.)

We observe a the following sequence of heads ($h$) and tails ($t$) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

What is the probability that heads comes up next? We solve this in the next slides ...

- We observe a sequence of $N$ coin tosses $D=\{x_1,\ldots,x_N\}$ with $n$ heads.

- Let us denote outcomes by $$x_k = \begin{cases} h & \text{if heads comes up} \\ t & \text{if tails} \end{cases} $$

- Assume a
**Bernoulli**distributed variable $p(x_k=h|\mu)=\mu$, which leads to a**binomial**distribution for the likelihood (assume $n$ times heads were thrown): $$ p(D|\mu) = \prod_{k=1}^N p(x_k|\mu) = \mu^n (1-\mu)^{N-n} $$

- Assume the prior belief is governed by a
**beta distribution**

where the Gamma function is sort of a generalized factorial function. If $\alpha,\beta$ are integers, then $$\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)(\Gamma(\beta)} = \frac{(\alpha+\beta-1)!}{(\alpha-1)!\,(\beta-1)!}$$

- A
*what*distribution? Yes, the**beta distribution**is a**conjugate prior**for the binomial distribution, which means that $$ \underbrace{\text{beta}}_{\text{posterior}} \propto \underbrace{\text{binomial}}_{\text{likelihood}} \times \underbrace{\text{beta}}_{\text{prior}} $$ so we get a closed-form posterior.

- $\alpha$ and $\beta$ are called
**hyperparameters**, since they parameterize the distribution for another parameter ($\mu$). E.g., $\alpha=\beta=1$ (uniform).

- (Bishop Fig.2.2). Plots of the beta distribution $\mathrm{Beta}(μ|a, b)$ as a function of $μ$ for various values of the hyperparameters $a$ and $b$.

- Infer posterior PDF over $\mu$ through Bayes rule

hence the posterior is also beta-distributed as

$$ p(\mu|D) = \mathrm{Beta}(\mu|\,n+\alpha, N-n+\beta) $$- For simplicity, we skip the model evaluation task here and proceed to
**apply**the trained model. Let's use it to predict future observations.

- Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\bullet$, given the data $D$,

- This result is known as
**Laplace's rule of succession**

- The above integral computes the mean of a beta distribution, which is given by $\mathbb{E}[x] = \frac{a}{a+b}$ for $x \sim \mathrm{Beta}(a,b)$, see wikipedia.

- Finally, we're ready to solve our example problem: for $D=\{hthhtth\}$ and uniform prior ($\alpha=\beta=1$), we get

- What did we learn from the data? Before seeing any data, we think that $$p(x_\bullet=h)=\left. p(x_\bullet=h|D) \right|_{n=N=0} = \frac{\alpha}{\alpha + \beta}\,.$$

- Hence, $\alpha$ and $\beta$ are prior pseudo-counts for heads and tails respectively.

- After the $N$ coin tosses, we think that $p(x_\bullet=h|D) = \frac{n+\alpha}{N+\alpha+\beta}$.

- Note the following decomposition

- Note that, since $0\leq \underbrace{\frac{N}{N+\alpha+\beta}}_{\text{gain}} \lt 1$, the Bayesian prediction lies between (fuses) the prior and data-based predictions. The data plays the role of "correcting" the prior prediction.

- For large $N$, the gain goes to $1$ and $\left. p(x_\bullet=h|D)\right|_{N\rightarrow \infty} \rightarrow \frac{n}{N}$ goes to the data-based prediction (the observed relative frequency).

- Let's see how $p(\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior).

In [8]:

```
using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
IJulia.clear_output();
```

In [9]:

```
using PyPlot, Distributions
f = figure()
range_grid = range(0.0, stop=1.0, length=100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
posterior1 = Array{Distribution}(undef,193)
posterior2 = Array{Distribution}(undef,193)
for N=0:1:192
n = sum(samples[1:N]) # Count number of heads in first N flips
posterior1[N+1] = Beta(1+n, 1+(N-n))
posterior2[N+1] = Beta(5+n, 5+(N-n))
end
fig = figure("Posterior distributions", figsize=(10,8));
ax1 = fig.add_subplot(2,2,1);
ax2 = fig.add_subplot(2,2,2);
ax3 = fig.add_subplot(2,2,3);
ax4 = fig.add_subplot(2,2,4);
plt.subplot(ax1); plot(range_grid,pdf.(posterior1[3],range_grid), "k-");
plt.subplot(ax1); plot(range_grid,pdf.(posterior2[3],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(3), n=$(sum(samples[1:3])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
plt.subplot(ax2); plot(range_grid,pdf.(posterior1[10],range_grid), "k-");
plt.subplot(ax2); plot(range_grid,pdf.(posterior2[10],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(10), n=$(sum(samples[1:10])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
plt.subplot(ax3); plot(range_grid,pdf.(posterior1[50],range_grid), "k-");
plt.subplot(ax3); plot(range_grid,pdf.(posterior2[50],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(50), n=$(sum(samples[1:50])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
plt.subplot(ax4); plot(range_grid,pdf.(posterior1[150],range_grid), "k-");
plt.subplot(ax4); plot(range_grid,pdf.(posterior2[150],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(150), n=$(sum(samples[1:150])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4);
```

$\Rightarrow$ With more data, the relevance of the prior diminishes!

- In the example above, Bayesian parameter estimation and prediction were tractable in closed-form. This is often not the case. We will need to approximate some of the computations.

- Recall Bayesian prediction

- If we approximate posterior $p(\theta|D)$ by a delta function for one 'best' value $\hat\theta$, then the predictive distribution collapses to

- This is just the data generating distribution $p(x|\theta)$ evaluated at $\theta=\hat\theta$, which is easy to evaluate.

- The next question is how to get the parameter estimate $\hat{\theta}$? (See next slide).

**Bayes estimate**(the mean of the posterior)

**Maximum A Posteriori**(MAP) estimate $$ \hat \theta_{\text{map}}= \arg\max _{\theta} p\left( \theta |D \right) = \arg \max_{\theta} p\left(D |\theta \right) \, p\left(\theta \right) $$

**Maximum Likelihood**(ML) estimate $$ \hat \theta_{ml} = \arg \max_{\theta} p\left(D |\theta\right) $$- Note that Maximum Likelihood is MAP with uniform prior
- ML is the most common approximation to the full Bayesian posterior.

Consider the task: predict a datum $x$ from an observed data set $D$.

Bayesian | Maximum Likelihood | |

1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors. |

2. Learning | use Bayes rule to find the parameter posterior, $$ p(\theta|D) \propto p(D|\theta) p(\theta) $$ | By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$ |

3. Prediction | $$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ | $$ p(x|D) = p(x|\hat\theta) $$ |

- Maximum Likelihood (ML) is MAP with uniform prior. MAP is sometimes called a 'penalized' ML procedure:

- (good!). ML works rather well if we have a lot of data because the influence of the prior diminishes with more data.

- (good!). Computationally often do-able. Useful fact that makes the optimization easier (since $\log$ is monotonously increasing):

- (bad). Cannot be used for model comparison! When doing ML estimation, the Bayesian model evidence evalutes to zero because the prior probability mass under the likelihood function goes to zero. Therefore, Bayesian model evidence cannot be used to evaluate model performance:

$\Rightarrow$ **ML estimation is an approximation to Bayesian learning**, but for good reason a very popular learning method when faced with lots of available data.

In [7]:

```
open("../../styles/aipstyle.html") do f display("text/html", read(f, String)) end
```

In [ ]:

```
```