The prior probabilities $p(C_1) = 0.6$ and $p(C_2) = 0.4$ are also known from experience.
(a) (##) A "Bayes Classifier" is given by
Derive the optimal Bayes classifier.
(b) (###) The probability of making the wrong decision, given $x$, is
Compute the total error probability $p(\text{error})$ for the Bayes classifier in this example.
[2] (#) (see Bishop exercise 4.8): Using (4.57) and (4.58) (from Bishop's book), derive the result (4.65) for the posterior class probability in the two-class generative model with Gaussian densities, and verify the results (4.66) and (4.67) for the parameters $w$ and $w0$.
[3] (###) (see Bishop exercise 4.9).
[4] (##) (see Bishop exercise 4.10).
where $\sigma(x) = 1/(1+e^{-x})$ is the logistic function. Let's introduce shorthand notation $\mu_n=\sigma(\theta^T x_n + b)$. So, for every input $x_n$, we have a model output $\mu_n$ and an actual data output $y_n$.
(a) Express $p(y_n|x_n)$ as a Bernoulli distribution in terms of $\mu_n$ and $y_n$.
(b) If furthermore is given that the data set is IID, show that the log-likelihood is given by
$$
L(\theta) \triangleq \log p(D|\theta) = \sum_n \left\{y_n \log \mu_n + (1-y_n)\log(1-\mu_n)\right\}
$$
(c) Prove that the derivative of the logistic function is given by
$$
\sigma^\prime(\xi) = \sigma(\xi)\cdot\left(1-\sigma(\xi)\right)
$$
(d) Show that the derivative of the log-likelihood is
$$
\nabla_\theta L(\theta) = \sum_{n=1}^N \left( y_n - \sigma(\theta^T x_n +b)\right)x_n
$$
(e) Design a gradient-ascent algorithm for maximizing $L(\theta)$ with respect to $\theta$.
[2] Describe shortly the similarities and differences between the discriminative and generative approach to classification.
[3] (Bishop ex.4.7) (#) Show that the logistic sigmoid function $\sigma(a) = \frac{1}{1+\exp(-a)}$ satisfies the property $\sigma(-a) = 1-\sigma(a)$ and that its inverse is given by $\sigma^{-1}(y) = \log\{y/(1-y)\}$.
[4] (Bishop ex.4.16) (###) Consider a binary classification problem in which each observation $x_n$ is known to belong to one of two classes, corresponding to $y_n = 0$ and $y_n = 1$. Suppose that the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data point $x_n$, instead of having a value $y_n$ for the class label, we have instead a value $\pi_n$ representing the probability that $y_n = 1$. Given a probabilistic model $p(y_n = 1|x_n,\theta)$, write down the log-likelihood function appropriate to such a data set.
[5] (###) Let $X$ be a real valued random variable with probability density
Also $Y$ is a real valued random variable with conditional density
$$
p_{Y|X}(y|x) = \frac{e^{-(y-x)^2/2}}{\sqrt{2\pi}},\quad\text{for all $x$ and $y$}.
$$
(a) Give an (integral) expression for $p_Y(y)$. Do not try to evaluate the integral.
(b) Approximate $p_Y(y)$ using the Laplace approximation.
Give the detailed derivation, not just the answer.
Hint: You may use the following results.
Let
$$g(x) = \frac{e^{-x^2/2}}{\sqrt{2\pi}}$$
and
$$
h(x) = \frac{e^{-(y-x)^2/2}}{\sqrt{2\pi}}$$
for some real value $y$. Then:
$$\begin{align*}
\frac{\partial}{\partial x} g(x) &= -xg(x) \\
\frac{\partial^2}{\partial x^2} g(x) &= (x^2-1)g(x) \\
\frac{\partial}{\partial x} h(x) &= (y-x)h(x) \\
\frac{\partial^2}{\partial x^2} h(x) &= ((y-x)^2-1)h(x)
\end{align*}$$