proof that the marginal distribution for observations $x_n$ evaluates to
$$ p(x) = \sum_{j=1}^K \pi_k \cdot \mathcal{N}\left( x | \mu_j, \Sigma_j \right) $$Proof that this algorithm minimizes the Free Energy functional $$\begin{align*} F[q,\theta] = \sum_z q(z) \log \frac{q(z)}{p(D,z|\theta)} \end{align*}$$
[7] (##) Consider a model $p(x,z|\theta) = p(x|z,\theta) p(z|\theta)$ where $x$ and $z$ relate to observed and unobserved variables, respectively. Also available is an observed data set $D=\left\{x_1,x_2,\ldots,x_N\right\}$. One iteration of the EM-algorithm for estimating the parameters $\theta$ is described by ($m$ is the iteration counter) $$ \hat{\theta}^{(m+1)} := \arg \max_\theta \left(\sum_z p(z|x=D,\hat{\theta}^{(m)}) \log p(x=D,z|\theta) \right) \,. $$
(a) Apparently, in order to execute EM, we need to work out an expression for the 'responsibility' $p(z|x=D,\hat{\theta}^{(m)})$. Use Bayes rule to show how we can compute the responsibility that allows us to execute an EM step.
(b) Why do we need multiple iterations in the EM algorithm?
(c) Why can't we just use simple maximum log-likelihood to estimate parameters, as described by $$ \hat{\theta} := \arg \max_\theta \log p(x=D,z|\theta) \,? $$
[8] In a particular model with hidden variables, the log-likelihood can be worked out to the following expression: $$ L(\theta) = \sum_n \log \left(\sum_k \pi_k\,\mathcal{N}(x_n|\mu_k,\Sigma_k)\right) $$ Do you prefer a gradient descent or EM algorithm to estimate maximum likelihood values for the parameters? Explain your answer. (No need to work out the equations.)