Discrete Data and the Multinomial Distribution¶

[1] (##) We consider IID data $D = \{x_1,x_2,\ldots,x_N\}$ obtained from tossing a $K$-sided die. We use a binary selection variable

$$x_{nk} \equiv \begin{cases} 1 & \text{if $x_n$ lands on $k$th face}\\ 0 & \text{otherwise} \end{cases} $$

with probabilities $p(x_{nk} = 1)=\theta_k$.
(a) Write down the probability for the $n$th observation $p(x_n|\theta)$ and derive the log-likelihood $\log p(D|\theta)$.
(b) Derive the maximum likelihood estimate for $\theta$.

See lecture notes (on class homepage).

(a) $p(x_n|\theta) = \prod_k \theta_k^{x_{nk}} \quad \text{subject to} \quad \sum_k \theta_k = 1$. $$\ell(\theta) = \sum_k m_k \log \theta_k$$ where $m_k = \sum_n x_{nk}$.
(b) $\hat \theta = \frac{m_k}{N}$, the sample proportion.

[2] (#) In the notebook, Laplace's generalized rule of succession (the probability that we throw the $k$th face at the next toss) was derived as

$$\begin{align*} p(x_{\bullet,k}=1|D) = \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}$$

Provide an interpretation of the variables $m_k,N,\alpha_k,\sum_k\alpha_k$.

$m_k$ is the total number of occurances that we threw $k$ eyes, $\alpha_k$ is the prior pseudo counts representing the number of observations in the $k$th that we assume to have seen already. $\sum_k m_k = N $ is the total number of rolls and $\sum_k \alpha_k $ is the total number of prior pseudo rolls.

[3] (##) Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term.

$$\begin{align*}

p(x_{\bullet,k}=1|D) &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \ &= \frac{N}{N+\sum_k \alpha_k} \frac{m_k}{N} + \frac{\sum_k \alpha_k}{N+\sum_k \alpha_k}\frac{\alpha_k}{\sum_k\alpha_k} \ &= \underbrace{\frac{\alpha_k}{\sum_k\alpha_k}}{\text{prior prediction}} + \underbrace{\frac{N}{N+\sum_k \alpha_k} \cdot \underbrace{\left(\frac{m_k}{N} - \frac{\alpha_k}{\sum_k\alpha_k}\right)}{\text{prediction error}}}_{\text{data-based correction}} \end{align*}$$

[4] (#) Verify that
(a) the categorial distribution is a special case of the multinomial for $N=1$.
(b) the Bernoulli is a special case of the categorial distribution for $K=2$.
(c) the binomial is a special case of the multinomial for $K=2$.

(a) The probability mass function of a multinomial distribution is $p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}$ over the data frequencies $D_m=\{m_1,\ldots,m_K\}$ with the constraint that $\sum_k \mu_k = 1$ and $\sum_k m_k=N$. Setting $N=1$ we see that $p(D_m|\mu) \propto \prod_k \mu_k^{m_k}$ with $\sum_k m_k=1$, making the sample space one-hot coded given by the categorical distribution.
(b) When $K=2$, the constraint for the categorical distribution takes the form $m_1=1-m_2$ leading to $p(D_m|\mu) \propto \mu_1^{m_1}(1-\mu_1)^{1-m_1}$ which is associated with the Bernoulli distribution.
(c) Plugging $K=2$ into the multinomial distribution leads to $p(D_m|\mu) =\frac{N!}{m_1! m_2!}\mu_1^{m_1}\left(\mu_2^{m_2}\right)$ with the constraints $m_1+m_2=N$ and $\mu_1+\mu_2=1$. Then plugging the constraints back in we obtain $p(D_m|\mu) = \frac{N!}{m_1! (N-m1)!}\mu_1^{m_1}\left(1-\mu_1\right)^{N-m_1}$ as the binomial distribution.

[5] (###) Determine the mean, variance and mode of a Beta distribution.

The Beta distribution is given by $\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$. Define $\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \triangleq \mathcal{B}(\alpha,\beta)$, which is the normalization constant. Notice that this definition makes $\int_0^1 x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x = \mathcal{B}(\alpha,\beta)$. Together with $\Gamma(x+1) = x\Gamma(x)$ we can use these identities to obtain the requested statistics:

$$\begin{align*} \mathbb{E}[x] &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1 x x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x \\ &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1x^{\alpha}(1-x)^{\beta-1}\mathrm{d}x \\ &= \frac{\mathcal{B}(\alpha+1,\beta)}{\mathcal{B}(\alpha,\beta)} \\ &= \frac{\Gamma(\alpha+1)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+1)}\\ &= \frac{\alpha \Gamma(\alpha)\Gamma(\alpha+\beta) }{(\alpha+\beta)\Gamma(\alpha)\Gamma(\alpha+\beta)}\\ &= \frac{\alpha}{\alpha+\beta} \\ \mathbb{V}[x] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\ &= \frac{1}{\mathcal{B}(\alpha,\beta)}\int_0^1 x^2 x^{\alpha-1}(1-x)^{\beta-1}\mathrm{d}x - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\mathcal{B}(\alpha+2,\beta)}{\mathcal{B}(\alpha,\beta)} - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\alpha}{\alpha+\beta}\left(\frac{\alpha+1}{\alpha+\beta+1} - \frac{\alpha}{\alpha+\beta}\right) \\ &= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{align*}$$

If $\alpha=\beta$, then the Beta distribution is identical to a uniform distribution, which doesn't have a unique mode. If one of the parameters is $<1$, then the mode is at one of the edges. When both parameters are $>1$, then the mode is well-defined and is within the interior of the distribution. Assuming the parameters are $>1$ we can evaluate the mode as $$\begin{align*} \nabla_x x^{\alpha-1}(1-x)^{\beta-1} &= 0\\ \frac{\alpha-1}{\beta-1} &= \frac{x}{1-x} \\ \alpha-1 &= x(\alpha+\beta-2) \\ \Rightarrow x_{mode} &= \frac{\alpha-1}{\alpha+\beta-2}. \end{align*}$$

[6] (###) Consider a data set of binary variables $D=\{x_1,x_2,\ldots,x_N\}$ with a Bernoulli distribution $\mathrm{Ber}(x_k|\mu)$ as data generating distribution and a Beta prior for $\mu$. Assume that you make $n$ observations with $x=1$ and $N-n$ observations with $x=0$. Now consider a new draw $x_\bullet$. We are interested in computing $p(x_\bullet|D)$. Show that the mean value for $p(x_\bullet|D)$ lies in between the prior mean and Maximum Likelihood estimate.

In the lectures we have seen that $p(x_\bullet =1|D) = \frac{a+n}{a+b+N}$, where $a$ and $b$ are parameters of the Beta prior. The ML estimate is $\frac{n}{N}$ and the prior mean is $\frac{a}{a+b}$. To show that the prediction lies in between ML and prior estimate, we will try to write the prediction as a convex combination of the latter two. That is we want to solve for $\lambda$

$$\begin{align*} (1-\lambda) \frac{n}{N} + \lambda\frac{a}{a+b} &= \frac{a+n}{a+b+N} \\ \lambda &= \frac{1}{1+\frac{N}{a+b}} \end{align*}$$

Since $a,b$ and $N$ are positive, it follows that $0<\lambda <1$. This means the prediction is a convex combination of prior and ML estimates and thus lies in between the two.

[7] Consider a data set $D = \{(x_1,y_1), (x_2,y_2),\dots,(x_N,y_N)\}$ with one-hot encoding for the $K$ discrete classes, i.e., $y_{nk} = 1$ if and only if $y_n \in \mathcal{C}_k$, else $y_{nk} = 0$. Also given are the class-conditional distribution $p(x_n| y_{nk}=1,\theta) = \mathcal{N}(x_n|\mu_k,\Sigma)$ and multinomial prior $p(y_{nk}=1) = \pi_k$. .
(a) Proof that the joint log-likelihood is given by

$$\begin{equation*} \log p(D|\theta) = \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma) + \sum_{n,k} y_{nk} \log \pi_k \end{equation*}$$

$$\begin{align*}

\log p(D|\theta) &= \sum_n \log \prod_k p(x_n,y_{nk}|\theta)^{y_{nk}} \ &= \sum_{n,k} y_{nk} \log p(x_n,y_{nk}|\theta)\ &= \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma) + \sum_{n,k} y_{nk} \log \pi_k \end{align*}$$

(b) Show now that the MLE of the class-conditional mean is given by $$\begin{equation*} \hat \mu_k = \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \end{equation*} $$