with probabilities $p(x_{nk} = 1)=\theta_k$.
(a) Write down the probability for the $n$th observation $p(x_n|\theta)$ and derive the log-likelihood $\log p(D|\theta)$.
(b) Derive the maximum likelihood estimate for $\theta$.
See lecture notes (on class homepage).
(a) $p(x_n|\theta) = \prod_k \theta_k^{x_{nk}} \quad \text{subject to} \quad \sum_k \theta_k = 1$.
$$\ell(\theta) = \sum_k m_k \log \theta_k$$
where $m_k = \sum_n x_{nk}$.
(b) $\hat \theta = \frac{m_k}{N}$, the sample proportion.
Provide an interpretation of the variables $m_k,N,\alpha_k,\sum_k\alpha_k$.
$m_k$ is the total number of occurances that we threw $k$ eyes, $\alpha_k$ is the prior pseudo counts representing the number of observations in the $k$th that we assume to have seen already. $\sum_k m_k = N $ is the total number of rolls and $\sum_k \alpha_k $ is the total number of prior pseudo rolls.
$$\begin{align*}
p(x_{\bullet,k}=1|D) &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \ &= \frac{N}{N+\sum_k \alpha_k} \frac{m_k}{N} + \frac{\sum_k \alpha_k}{N+\sum_k \alpha_k}\frac{\alpha_k}{\sum_k\alpha_k} \ &= \underbrace{\frac{\alpha_k}{\sum_k\alpha_k}}{\text{prior prediction}} + \underbrace{\frac{N}{N+\sum_k \alpha_k} \cdot \underbrace{\left(\frac{m_k}{N} - \frac{\alpha_k}{\sum_k\alpha_k}\right)}{\text{prediction error}}}_{\text{data-based correction}} \end{align*}$$
(a) The probability mass function of a multinomial distribution is $p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}$ over the data frequencies $D_m=\{m_1,\ldots,m_K\}$ with the constraint that $\sum_k \mu_k = 1$ and $\sum_k m_k=N$. Setting $N=1$ we see that $p(D_m|\mu) \propto \prod_k \mu_k^{m_k}$ with $\sum_k m_k=1$, making the sample space one-hot coded given by the categorical distribution.
(b) When $K=2$, the constraint for the categorical distribution takes the form $m_1=1-m_2$ leading to $p(D_m|\mu) \propto \mu_1^{m_1}(1-\mu_1)^{1-m_1}$ which is associated with the Bernoulli distribution.
(c) Plugging $K=2$ into the multinomial distribution leads to $p(D_m|\mu) =\frac{N!}{m_1! m_2!}\mu_1^{m_1}\left(\mu_2^{m_2}\right)$ with the constraints $m_1+m_2=N$ and $\mu_1+\mu_2=1$. Then plugging the constraints back in we obtain $p(D_m|\mu) = \frac{N!}{m_1! (N-m1)!}\mu_1^{m_1}\left(1-\mu_1\right)^{N-m_1}$ as the binomial distribution.
$$\begin{align*} \mathbb{E}[x] &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1 x x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x \\ &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1x^{\alpha}(1-x)^{\beta-1}\mathrm{d}x \\ &= \frac{\mathcal{B}(\alpha+1,\beta)}{\mathcal{B}(\alpha,\beta)} \\ &= \frac{\Gamma(\alpha+1)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+1)}\\ &= \frac{\alpha \Gamma(\alpha)\Gamma(\alpha+\beta) }{(\alpha+\beta)\Gamma(\alpha)\Gamma(\alpha+\beta)}\\ &= \frac{\alpha}{\alpha+\beta} \\ \mathbb{V}[x] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\ &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1 x^2 x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\mathcal{B}(\alpha+2,\beta)}{\mathcal{B}(\alpha,\beta)} - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\alpha}{\alpha+\beta}\left(\frac{\alpha+1}{\alpha+\beta+1} - \frac{\alpha}{\alpha+\beta}\right) \\ &= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{align*}$$The Beta distribution is given by $\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$. Define $\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \triangleq \mathcal{B}(\alpha,\beta)$, which is the normalization constant. Notice that this definition makes $\int_0^1 x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x = \mathcal{B}(\alpha,\beta)$. Together with $\Gamma(x+1) = x\Gamma(x)$ we can use these identities to obtain the requested statistics:
If $\alpha=\beta$, then the Beta distribution is identical to a uniform distribution, which doesn't have a unique mode. If one of the parameters is $<1$, then the mode is at one of the edges. When both parameters are $>1$, then the mode is well-defined and is within the interior of the distribution. Assuming the parameters are $>1$ we can evaluate the mode as $$\begin{align*} \nabla_x x^{\alpha-1}(1-x)^{\beta-1} &= 0\\ \frac{\alpha-1}{\beta-1} &= \frac{x}{1-x} \\ \alpha-1 &= x(\alpha+\beta-2) \\ \Rightarrow x_{mode} &= \frac{\alpha-1}{\alpha+\beta-2}. \end{align*}$$
$$\begin{align*} (1-\lambda) \frac{n}{N} + \lambda\frac{a}{a+b} &= \frac{a+n}{a+b+N} \\ \lambda &= \frac{1}{1+\frac{N}{a+b}} \end{align*}$$In the lectures we have seen that $p(x_\bullet =1|D) = \frac{a+n}{a+b+N}$, where $a$ and $b$ are parameters of the Beta prior. The ML estimate is $\frac{n}{N}$ and the prior mean is $\frac{a}{a+b}$. To show that the prediction lies in between ML and prior estimate, we will try to write the prediction as a convex combination of the latter two. That is we want to solve for $\lambda$
Since $a,b$ and $N$ are positive, it follows that $0<\lambda <1$. This means the prediction is a convex combination of prior and ML estimates and thus lies in between the two.
$$\begin{align*}
\log p(D|\theta) &= \sum_n \log \prod_k p(x_n,y_{nk}|\theta)^{y_{nk}} \ &= \sum_{n,k} y_{nk} \log p(x_n,y_{nk}|\theta)\ &= \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma) + \sum_{n,k} y_{nk} \log \pi_k \end{align*}$$
(b) Show now that the MLE of the class-conditional mean is given by $$\begin{equation*} \hat \mu_k = \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \end{equation*} $$