### Bayesian Machine Learning¶

• [1] (#) (a) Explain shortly the relation between machine learning and Bayes rule.
(b) How are Maximum a Posteriori (MAP) and Maximum Likelihood (ML) estimation related to Bayes rule and machine learning?

(a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. Bayes rule makes this statement precise. Let $\theta \in \Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule, $$p(\theta|D) = \frac{p(D|\theta)}{p(D)} p(\theta)$$ relates the information that we have about $\theta$ before we saw the data (i.e., the distribution $p(\theta)$) to what we know after having seen the data, $p(\theta|D)$.
(b) The Maximum a Posteriori (MAP) estimate picks a value $\hat\theta$ for which the posterior distribution $p(\theta|D)$ is maximal, i.e., $$\hat\theta_{MAP} = \arg\max_\theta p(\theta|D)$$ In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\theta|D)$ by $\delta(\theta-\hat\theta_{\text{MAP}})$. Note that, by Bayes rule, $$\arg\max_\theta p(\theta|D) = \arg\max_\theta p(D|\theta)p(\theta)$$ If we further assume that prior to seeing the data all values for $\theta$ are equally likely (i.e., $p(\theta)=\text{const.}$), then the MAP estimate reduces to the Maximum Likelihood estimate, $$\hat\theta_{ML} = \arg\max_\theta p(D|\theta)$$

• [2] (#) What are the four stages of the Bayesian design approach?

(1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.

• [3] (##) The Bayes estimate is a summary of a posterior distribution by a delta distribution on its mean, i.e., $$\hat \theta_{bayes} = \int \theta \, p\left( \theta |D \right) \,\mathrm{d}{\theta}$$ Proof that the Bayes estimate minimizes the expected mean-squared error, i.e., proof that $$\hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta}$$

To minimize the expected mean-squared error we will look for $\hat{\theta}$ that makes the gradient of the integral with respect to $\hat{\theta}$ vanish. \begin{align*} \nabla_{\hat{\theta}} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta \nabla_{\hat{\theta}} (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta 2(\hat \theta -\theta) p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta \hat \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \\ \hat \theta \underbrace{\int_\theta p \left( \theta |D \right) \,\mathrm{d}{\theta}}_{1} &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \\ \hat \theta &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \end{align*}

• [4] (###) We make $N$ IID observations $D=\{x_1 \dots x_N\}$ and assume the following model $$x_k = A + \epsilon_k$$ where $\epsilon_k = \mathcal{N}(\epsilon_k | 0,\sigma^2)$ with known $\sigma^2=1$. We are interested in deriving an estimator for $A$.
(a) Make a reasonable assumption for a prior on $A$ and derive a Bayesian (posterior) estimate.
(b) (##) Derive the Maximum Likelihood estimate for $A$.
(c) Derive the MAP estimates for $A$.
(d) Now assume that we do not know the variance of the noise term? Describe the procedure for Bayesian estimation of both $A$ and $\sigma^2$ (No need to fully work out to closed-form estimates).

(a) Since there is no assumption on the values A can take it makes sense to assume a distribution that has support over the reals. A Gaussian prior is a good candidate. Let us assume $p(A) = \mathcal{N}(A|m_A,v_A)$. Since $p(D|A) = \prod_k \mathcal{N}(x_k|A,\sigma^2)$ is a Gaussian likelihood and $p(A)$ is a Gaussian prior, their multiplication is proportional to a Gaussian. We will work this out with the canonical parameterization of the Gaussian since it is easier to multiply Gaussians in that domain. This means the posterior $p(A|D)$ is \begin{align*} p(A|D) &\propto p(A) p(D|A) \\ &= \mathcal{N}(A|m_A,v_A) \prod_{k=1}^N \mathcal{N}(x_k|A,\sigma^2) \\ &= \mathcal{N}(A|m_A,v_A) \prod_{k=1}^N \mathcal{N}(A|x_k,\sigma^2) \\ &= \mathcal{N}_c\big(A \Bigm|\frac{m_A}{v_A},\frac{1}{v_A}\big)\prod_{k=1}^N \mathcal{N}_c\big(A\Bigm| \frac{x_k}{\sigma^2},\frac{1}{\sigma^2}\big) \\ &\propto \mathcal{N}_c\big(A \Bigm| \frac{m_A}{v_A} + \frac{1}{\sigma^2} \sum_k x_k , \frac{1}{v_A} + \frac{N}{\sigma^2} \big) \,, \end{align*} where we have made use of the fact that precision-weighted means and precisions add when multiplying Gaussians. In principle this description of the posterior completes the answer.
(b) The ML estimate can be found by \begin{align*} \nabla \log p(D|A) &=0\\ \nabla \sum_k \log \mathcal{N}(x_k|A,\sigma^2) &= 0 \\ \nabla \frac{-1}{2}\sum_k \frac{(x_k-A)^2}{\sigma^2} &=0\\ \sum_k(x_k-A) &= 0 \\ \hat{A}_{ML} = \frac{1}{N}\sum_{k=1}^N x_k \end{align*}
(c) The MAP is simply the location where the posterior has its maximum value, which for a Gaussian posterior is its mean value. We computed in (a) the precision-weighted mean, so we need to divide by precision (or multiply by variance) to get the location of the mean:
\begin{align*} \hat{A}_{MAP} &= \left( \frac{m_A}{v_A} + \frac{1}{\sigma^2} \sum_k x_k\right)\cdot \left( \frac{1}{v_A} + \frac{N}{\sigma^2} \right)^{-1} \\ &= \frac{v_A \sum_k x_k + \sigma^2 m_A}{N v_A + \sigma^2} \end{align*}
(d) A Bayesian treatment requires putting a prior on the unknown variance. The variance is constrained to be positive hence the support of the prior distribution needs to be on the positive reals. (In a multivariate case positivity needs to be extended to symmetric positive definiteness.) Choosing a conjugate prior will simplify matters greatly. In this scenerio the inverse Gamma distribution is the conjugate prior for the unknown variance. In the literature this model is called a Normal-Gamma distribution. See https://www.seas.harvard.edu/courses/cs281/papers/murphy-2007.pdf for the analytical treatment.

• [5] (##) We consider the coin toss example from the notebook and use a conjugate prior for a Bernoulli likelihood function.
(a) Derive the Maximum Likelihood estimate.
(b) Derive the MAP estimate.
(c) Do these two estimates ever coincide (if so under what circumstances)?

(a) \begin{align*} \nabla \log p(D|\mu) &= 0 \\ \nabla \left( n\log \mu + (N-n)\log(1-\mu)\right) &= 0\\ \frac{n}{\mu} - \frac{N-n}{1-\mu} &= 0 \\ \hat{\mu}_{\text{ML}} = \frac{n}{N} \end{align*}
(b) Assuming a beta prior $\mathcal{B}(\mu|\alpha,\beta)$, we can write the posterior as as \begin{align*} p(\mu|D) &\propto p(D|\mu)p(\mu) \\ &\propto \mu^n (1-\mu)^{N-n} \mu^{\alpha-1} (1-\mu)^{\beta-1} \\ &\propto \mathcal{B}(\mu|n+\alpha,N-n+\beta) \end{align*}
The MAP estimate for a beta distribution $\mathcal{B}(a,b)$ is located at $\frac{a - 1}{a+b-2}$, see wikipedia. Hence, \begin{align*} \hat{\mu}_{\text{MAP}} &= \frac{(n+\alpha)-1}{(n+\alpha) + (N-n+\beta) -2} \\ &= \frac{n+\alpha-1}{N + \alpha +\beta -2} \end{align*} (c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution.

In [ ]: