[5] (#) In the class we use the following prior for the weights:
$$\begin{equation*}
p(w|\alpha) = \mathcal{N}\left(w | 0, \alpha^{-1} I \right)
\end{equation*}$$
(a) Give some considerations for choosing a Gaussian prior for the weights.
(b) We could have chosen a prior with full (not diagonal) covariance matrix $p(w|\alpha) = \mathcal{N}\left(w | 0, \Sigma \right)$. Would that be better? Give your thoughts on that issue.
(c) Generally we choose $\alpha$ as a small positive number. Give your thoughts on that choice as opposed to choosing a large positive value. How about choosing a negative value for $\alpha$?
[6] Consider an IID data set $D=\{(x_1,y_1),\ldots,(x_N,y_N)\}$. We will model this data set by a model $$y_n =\theta^T f(x_n) + e_n\,,$$ where $f(x_n)$ is an $M$-dimensional feature vector of input $x_n$; $y_n$ is a scalar output and $e_n \sim \mathcal{N}(0,\sigma^2)$.
(a) Rewrite the model in matrix form by lumping input features in a matrix $F=[f(x_1),\ldots,f(x_N)]^T$, outputs and noise in the vectors $y=[y_1,\ldots,y_N]^T$ and $e=[e_1,\ldots,e_N]^T$, respectively.
(b) Now derive an expression for the log-likelihood $\log p(y|\,F,\theta,\sigma^2)$.
(c) Proof that the maximum likelihood estimate for the parameters is given by $$\hat\theta_{\text{ml}} = (F^TF)^{-1}F^Ty$$
(d) What is the predicted output value $y_\bullet$, given an observation $x_\bullet$ and the maximum likelihood parameters $\hat \theta_{\text{ml}}$. Work this expression out in terms of $F$, $y$ and $f(x_\bullet)$.
(e) Suppose that, before the data set $D$ was observed, we had reason to assume a prior distribution $p(\theta)=\mathcal{N}(0,\sigma_0^2)$. Derive the Maximum a posteriori (MAP) estimate $\hat \theta_{\text{map}}$.(hint: work this out in the $\log$ domain.)