# Discrete Data and the Multinomial Distribution¶

### Preliminaries¶

• Goal
• Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets
• Materials
• Mandatory
• These lecture notes
• Optional
• Bishop pp. 67-70, 74-76, 93-94

### Discrete Data: the 1-of-K Coding Scheme¶

• Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (tail and head) and let $0\leq \mu \leq 1$ represent the probability of heads. This model can written as a Bernoulli distribution: $$p(x|\mu) = \mu^{x}(1-\mu)^{1-x}$$
• Note that the variable $x$ acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.
• Now consider a $K$-sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes?
• Option 1: $x \in \{1,2,\ldots,K\}$.

• E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=3$.
• Option 2: $x=(x_1,\ldots,x_K)^T$ with binary selection variables $$x_k = \begin{cases} 1 & \text{if die landed on kth face}\\ 0 & \text{otherwise} \end{cases}$$

• E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=(0,0,1,0,0,0)^T$.
• This coding scheme is called a 1-of-K or one-hot coding scheme.
• It turns out that the one-hot coding scheme is mathematically more convenient!

### The Categorical Distribution¶

• Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \mu_k$ with $\sum_k \mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)
$$p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_K^{x_K}=\prod_{k=1}^K \mu_k^{x_k} \tag{B-2.26}$$
• This generalized Bernoulli distribution is called the categorical distribution (or sometimes the 'multi-noulli' distribution).

### Bayesian Density Estimation for a Loaded Die¶

• Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model $p(x|\theta)$ for an observed data set $D=\{x_1,\ldots,x_N\}$ of $N$ IID rolls of a $K$-sided die, with
$$x_{nk} = \begin{cases} 1 & \text{if the nth throw landed on kth face}\\ 0 & \text{otherwise} \end{cases}$$

##### Model specification¶
• The data generating PDF is $$p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$ where $m_k= \sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\sum_k m_k = N$.
• This distribution depends on the observations only through the quantities $\{m_k\}$.
• We need a prior for the parameters $\mu = (\mu_1,\mu_2,\ldots,\mu_K)$. In the binary coin toss example, we used a beta distribution that was conjugate with the binomial and forced us to choose prior pseudo-counts.

• The generalization of the beta prior to the $K$ parameters $\{\mu_k\}$ is the Dirichlet distribution: $$p(\mu|\alpha) = \mathrm{Dir}(\mu|\alpha) = \frac{\Gamma\left(\sum_k \alpha_k\right)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1}$$ where $\Gamma(\dot)$ is the Gamma function. The Gamma function can be interpreted as a generalization of the factorial function (e.g., $3! = 3\cdot 2 \cdot 1$) to the real ($\mathbb{R}$) numbers.

• As before for the Beta distribution in the coin toss experiment, you can interpret $\alpha_k$ as the prior number of (pseudo-)observations that the die landed on the $k$th face.

##### Inference for $\{\mu_k\}$¶
• The posterior for $\{\mu_k\}$ can be obtained through Bayes rule:
\begin{align*} p(\mu|D,\alpha) &\propto p(D|\mu) \cdot p(\mu|\alpha) \\ &\propto \prod_k \mu_k^{m_k} \cdot \prod_k \mu_k^{\alpha_k-1} \\ &= \prod_k \mu_k^{\alpha_k + m_k -1}\\ &\propto \mathrm{Dir}\left(\mu\,|\,\alpha + m \right) \tag{B-2.41} \\ &= \frac{\Gamma\left(\sum_k (\alpha_k + m_k) \right)}{\Gamma(\alpha_1+m_1) \Gamma(\alpha_2+m_2) \cdots \Gamma(\alpha_K + m_K)} \prod_{k=1}^K \mu_k^{\alpha_k + m_k -1} \end{align*}

where $m = (m_1,m_2,\ldots,m_K)^T$.

• Again, we recognize the $\alpha_k$'s as prior pseudo-counts and the Dirichlet distribution shows to be a conjugate prior to the categorical/multinomial:
\begin{align*} \underbrace{\text{Dirichlet}}_{\text{posterior}} &\propto \underbrace{\text{categorical}}_{\text{likelihood}} \cdot \underbrace{\text{Dirichlet}}_{\text{prior}} \end{align*}
• This is actually a generalization of the conjugate relation that we found for the binary coin toss:
\begin{align*} \underbrace{\text{Beta}}_{\text{posterior}} &\propto \underbrace{\text{binomial}}_{\text{likelihood}} \cdot \underbrace{\text{Beta}}_{\text{prior}} \end{align*}
##### Prediction of next toss for the loaded die¶
• Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$th face at the next toss.
\begin{align*} p(x_{\bullet,k}=1|D) &= \int p(x_{\bullet,k}=1|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu_k \times \mathcal{Dir}(\mu|\,\alpha+m) \,\mathrm{d}\mu \\ &= \mathrm{E}\left[ \mu_k \right] \\ &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}
• In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\{x_1,\ldots,x_N\}$ only depends on the data frequencies $m_k$: $$p(D|\mu) = \prod_n \underbrace{\prod_k \mu_k^{x_{nk}}}_{\text{categorical dist.}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$
• A related distribution is the distribution over data frequency observations $D_m=\{m_1,\ldots,m_K\}$, which is called the multinomial distribution, $$p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}\,.$$
• When used as a likelihood function for $\mu$, it makes no difference whether you use $p(D|\mu)$ or $p(D_m|\mu)$. Why?
• Verify for yourself that (Exercise):
• the categorial distribution is a special case of the multinomial for $N=1$.
• the Bernoulli is a special case of the categorial distribution for $K=2$.
• the binomial is a special case of the multinomial for $K=2$.

### Maximum Likelihood Estimation for the Multinomial¶

##### Maximum likelihood as a special case of Bayesian estimation¶
• We can get the maximum likelihood estimate $\hat{\mu}_k$for $\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\mu$ and taking the mode of the posterior for $\mu$: \begin{align*} \hat{\mu}_k &= \arg\max_{\mu_k} p(D|\mu) \\ &= \arg\max_{\mu_k} p(D|\mu)\cdot \mathrm{Uniform}(\mu) \\ &= \arg\max_{\mu_k} p(D|\mu) \cdot \left.\mathrm{Dir}(\mu|\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.p(\mu|D,\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.\mathrm{Dir}\left( \mu | m + \alpha \right)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \frac{m_k}{\sum_k m_k} = \frac{m_k}{N} \end{align*} where we used the fact that the maximum of the Dirichlet distribution $\mathrm{Dir}(\{\alpha_1,\ldots,\alpha_K\})$ is obtained at $(\alpha_k-1)/(\sum_k\alpha_k - K)$.
##### Maximum likelihood estimation by optimizing a constrained log-likelihood¶
• Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum of the likelihood directly.

• The log-likelihood for the multinomial distribution is given by

\begin{align*} \mathrm{L}(\mu) &\triangleq \log p(D_m|\mu) \propto \log \prod_k \mu_k^{m_k} = \sum_k m_k \log \mu_k \end{align*}

• When doing ML estimation, we must obey the constraint $\sum_k \mu_k = 1$, which can be accomplished by a Lagrange multiplier (see Bishop App.E). The augmented log-likelihood with Lagrange multiplier is then
$$\mathrm{L}^\prime(\mu) = \sum_k m_k \log \mu_k + \lambda \cdot \left(1 - \sum_k \mu_k \right)$$
• Set derivative to zero yields the sample proportion for $\mu_k$ $$\begin{equation*} \nabla_{\mu_k} \mathrm{L}^\prime = \frac{m_k } {\hat\mu_k } - \lambda \overset{!}{=} 0 \; \Rightarrow \; \hat\mu_k = \frac{m_k }{N} \end{equation*}$$ where we get $\lambda$ from the constraint $$\begin{equation*} \sum_k \hat \mu_k = \sum_k \frac{m_k} {\lambda} = \frac{N}{\lambda} \overset{!}{=} 1 \end{equation*}$$

### Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions¶

Given $N$ IID observations $D=\{x_1,\dotsc,x_N\}$.

• For a multivariate Gaussian model $p(x_n|\theta) = \mathcal{N}(x_n|\mu,\Sigma)$, we obtain ML estimates
\begin{align} \hat \mu &= \frac{1}{N} \sum_n x_n \tag{sample mean} \\ \hat \Sigma &= \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat \mu)^T \tag{sample variance} \end{align}
• For discrete outcomes modeled by a 1-of-K categorical distribution we find
\begin{align} \hat\mu_k = \frac{1}{N} \sum_n x_{nk} \quad \left(= \frac{m_k}{N} \right) \tag{sample proportion} \end{align}
• Note the similarity for the means between discrete and continuous data.
In :
open("../../styles/aipstyle.html") do f