Discrete Data and the Multinomial Distribution


  • Goal
    • Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets
  • Materials
    • Mandatory
      • These lecture notes
    • Optional
      • Bishop pp. 67-70, 74-76, 93-94

Discrete Data: the 1-of-K Coding Scheme

  • Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (tail and head) and let $0\leq \mu \leq 1$ represent the probability of heads. This model can written as a Bernoulli distribution: $$ p(x|\mu) = \mu^{x}(1-\mu)^{1-x} $$
    • Note that the variable $x$ acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.
  • Now consider a $K$-sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes?
  • Option 1: $x \in \{1,2,\ldots,K\}$.

    • E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=3$.
  • Option 2: $x=(x_1,\ldots,x_K)^T$ with binary selection variables $$ x_k = \begin{cases} 1 & \text{if die landed on $k$th face}\\ 0 & \text{otherwise} \end{cases} $$

    • E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=(0,0,1,0,0,0)^T$.
    • This coding scheme is called a 1-of-K or one-hot coding scheme.
  • It turns out that the one-hot coding scheme is mathematically more convenient!

The Categorical Distribution

  • Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \mu_k$ with $\sum_k \mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)
$$ p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_K^{x_K}=\prod_{k=1}^K \mu_k^{x_k} \tag{B-2.26} $$
  • This generalized Bernoulli distribution is called the categorical distribution (or sometimes the 'multi-noulli' distribution).

Bayesian Density Estimation for a Loaded Die

  • Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model $p(x|\theta)$ for an observed data set $D=\{x_1,\ldots,x_N\}$ of $N$ IID rolls of a $K$-sided die, with
$$ x_{nk} = \begin{cases} 1 & \text{if the $n$th throw landed on $k$th face}\\ 0 & \text{otherwise} \end{cases} $$

Model specification
  • The data generating PDF is $$ p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29} $$ where $m_k= \sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\sum_k m_k = N$.
    • This distribution depends on the observations only through the quantities $\{m_k\}$.
  • We need a prior for the parameters $\mu = (\mu_1,\mu_2,\ldots,\mu_K)$. In the binary coin toss example, we used a beta distribution that was conjugate with the binomial and forced us to choose prior pseudo-counts.

  • The generalization of the beta prior to the $K$ parameters $\{\mu_k\}$ is the Dirichlet distribution: $$ p(\mu|\alpha) = \mathrm{Dir}(\mu|\alpha) = \frac{\Gamma\left(\sum_k \alpha_k\right)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1} $$ where $\Gamma(\dot)$ is the Gamma function. The Gamma function can be interpreted as a generalization of the factorial function (e.g., $3! = 3\cdot 2 \cdot 1$) to the real ($\mathbb{R}$) numbers.

  • As before for the Beta distribution in the coin toss experiment, you can interpret $\alpha_k$ as the prior number of (pseudo-)observations that the die landed on the $k$th face.

Inference for $\{\mu_k\}$
  • The posterior for $\{\mu_k\}$ can be obtained through Bayes rule:
$$\begin{align*} p(\mu|D,\alpha) &\propto p(D|\mu) \cdot p(\mu|\alpha) \\ &\propto \prod_k \mu_k^{m_k} \cdot \prod_k \mu_k^{\alpha_k-1} \\ &= \prod_k \mu_k^{\alpha_k + m_k -1}\\ &\propto \mathrm{Dir}\left(\mu\,|\,\alpha + m \right) \tag{B-2.41} \\ &= \frac{\Gamma\left(\sum_k (\alpha_k + m_k) \right)}{\Gamma(\alpha_1+m_1) \Gamma(\alpha_2+m_2) \cdots \Gamma(\alpha_K + m_K)} \prod_{k=1}^K \mu_k^{\alpha_k + m_k -1} \end{align*}$$

where $m = (m_1,m_2,\ldots,m_K)^T$.

  • Again, we recognize the $\alpha_k$'s as prior pseudo-counts and the Dirichlet distribution shows to be a conjugate prior to the categorical/multinomial:
$$\begin{align*} \underbrace{\text{Dirichlet}}_{\text{posterior}} &\propto \underbrace{\text{categorical}}_{\text{likelihood}} \cdot \underbrace{\text{Dirichlet}}_{\text{prior}} \end{align*}$$
  • This is actually a generalization of the conjugate relation that we found for the binary coin toss:
$$\begin{align*} \underbrace{\text{Beta}}_{\text{posterior}} &\propto \underbrace{\text{binomial}}_{\text{likelihood}} \cdot \underbrace{\text{Beta}}_{\text{prior}} \end{align*}$$
Prediction of next toss for the loaded die
  • Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$th face at the next toss.
$$\begin{align*} p(x_{\bullet,k}=1|D) &= \int p(x_{\bullet,k}=1|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu_k \times \mathcal{Dir}(\mu|\,\alpha+m) \,\mathrm{d}\mu \\ &= \mathrm{E}\left[ \mu_k \right] \\ &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}$$
  • In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\{x_1,\ldots,x_N\}$ only depends on the data frequencies $m_k$: $$ p(D|\mu) = \prod_n \underbrace{\prod_k \mu_k^{x_{nk}}}_{\text{categorical dist.}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29} $$
  • A related distribution is the distribution over data frequency observations $D_m=\{m_1,\ldots,m_K\}$, which is called the multinomial distribution, $$ p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}\,. $$
  • When used as a likelihood function for $\mu$, it makes no difference whether you use $p(D|\mu)$ or $p(D_m|\mu)$. Why?
  • Verify for yourself that (Exercise):
    • the categorial distribution is a special case of the multinomial for $N=1$.
    • the Bernoulli is a special case of the categorial distribution for $K=2$.
    • the binomial is a special case of the multinomial for $K=2$.

Maximum Likelihood Estimation for the Multinomial

Maximum likelihood as a special case of Bayesian estimation
  • We can get the maximum likelihood estimate $\hat{\mu}_k$for $\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\mu$ and taking the mode of the posterior for $\mu$: $$\begin{align*} \hat{\mu}_k &= \arg\max_{\mu_k} p(D|\mu) \\ &= \arg\max_{\mu_k} p(D|\mu)\cdot \mathrm{Uniform}(\mu) \\ &= \arg\max_{\mu_k} p(D|\mu) \cdot \left.\mathrm{Dir}(\mu|\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.p(\mu|D,\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.\mathrm{Dir}\left( \mu | m + \alpha \right)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \frac{m_k}{\sum_k m_k} = \frac{m_k}{N} \end{align*}$$ where we used the fact that the maximum of the Dirichlet distribution $\mathrm{Dir}(\{\alpha_1,\ldots,\alpha_K\})$ is obtained at $(\alpha_k-1)/(\sum_k\alpha_k - K)$.
Maximum likelihood estimation by optimizing a constrained log-likelihood
  • Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum of the likelihood directly.

  • The log-likelihood for the multinomial distribution is given by

$$\begin{align*} \mathrm{L}(\mu) &\triangleq \log p(D_m|\mu) \propto \log \prod_k \mu_k^{m_k} = \sum_k m_k \log \mu_k \end{align*}$$

  • When doing ML estimation, we must obey the constraint $\sum_k \mu_k = 1$, which can be accomplished by a Lagrange multiplier (see Bishop App.E). The augmented log-likelihood with Lagrange multiplier is then
$$ \mathrm{L}^\prime(\mu) = \sum_k m_k \log \mu_k + \lambda \cdot \left(1 - \sum_k \mu_k \right) $$
  • Set derivative to zero yields the sample proportion for $\mu_k$ $$\begin{equation*} \nabla_{\mu_k} \mathrm{L}^\prime = \frac{m_k } {\hat\mu_k } - \lambda \overset{!}{=} 0 \; \Rightarrow \; \hat\mu_k = \frac{m_k }{N} \end{equation*}$$ where we get $\lambda$ from the constraint $$\begin{equation*} \sum_k \hat \mu_k = \sum_k \frac{m_k} {\lambda} = \frac{N}{\lambda} \overset{!}{=} 1 \end{equation*}$$

Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions

Given $N$ IID observations $D=\{x_1,\dotsc,x_N\}$.

  • For a multivariate Gaussian model $p(x_n|\theta) = \mathcal{N}(x_n|\mu,\Sigma)$, we obtain ML estimates
$$\begin{align} \hat \mu &= \frac{1}{N} \sum_n x_n \tag{sample mean} \\ \hat \Sigma &= \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat \mu)^T \tag{sample variance} \end{align}$$
  • For discrete outcomes modeled by a 1-of-K categorical distribution we find
$$\begin{align} \hat\mu_k = \frac{1}{N} \sum_n x_{nk} \quad \left(= \frac{m_k}{N} \right) \tag{sample proportion} \end{align}$$
  • Note the similarity for the means between discrete and continuous data.
In [2]:
open("../../styles/aipstyle.html") do f
    display("text/html", read(f, String))
In [ ]: