#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_cell_magic('capture', '', '%load_ext autoreload\n%autoreload 2\n# %cd ..\nimport sys\nsys.path.append("..")\nimport statnlpbook.util as util\nutil.execute_notebook(\'language_models.ipynb\')\n# import tikzmagic\n%load_ext tikzmagic\nmatplotlib.rcParams[\'figure.figsize\'] = (10.0, 6.0)\n')


# <!---
# Latex Macros
# -->
# $$
# \newcommand{\prob}{p}
# \newcommand{\x}{\mathbf{x}}
# \newcommand{\vocab}{V}
# \newcommand{\params}{\boldsymbol{\theta}}
# \newcommand{\param}{\theta}
# \DeclareMathOperator{\perplexity}{PP}
# \DeclareMathOperator{\argmax}{argmax}
# \newcommand{\train}{\mathcal{D}}
# \newcommand{\counts}[2]{\#_{#1}(#2) }
# $$

# # Language Models

# ## Language Models
# 
# calculate the **probability of seeing a sequence of words**. 

# For example: how likely is the following sequence?
# 
# > We're going to win bigly. 

# Is it more likely than this one?
# 
# > We're going to win big league.

# ## Use Cases: Machine Translation
# 
# > Wir werden haushoch gewinnen
# 
# translates to?

# > We will win by a mile
# 
# or 
# 
# > We will win bigly

# ## Use Cases: Speech Recognition
# 
# What did he [say](https://www.theguardian.com/us-news/video/2016/may/04/donald-trump-we-are-going-to-win-bigly-believe-me-video)?
# 
# > We're going to win bigly
# 
# or
# 
# > We're going to win big league

# ## Use Cases: Natural Language Generation
# 
# https://twitter.com/deepdrumpf
# 
# Other applications?

# ## Formally
# Models the probability 
# 
# $$\prob(w_1,\ldots,w_d)$$ 
# 
# of observing sequences of words \\(w_1,\ldots,w_d\\). 

# Without loss of generality: 
# 
# \begin{align}
# \prob(w_1,\ldots,w_d) &= p(w_1) p(w_2|w_1) p(w_3|w_1, w_2) \ldots \\
#  &= \prob(w_1) \prod_{i = 2}^d \prob(w_i|w_1,\ldots,w_{i-1})
# \end{align}

# ### Structured Prediction
# 
# predict word $y=w_i$ 
# * conditioned on history $\x=w_1,\ldots,w_{i-1}$.

# ## N-Gram Language Models
# 
# Impossible to estimate sensible probability for each history 
# 
# $$
# \x=w_1,\ldots,w_{i-1}
# $$

# ### Change **representation**
# truncate history to last $n-1$ words: 
# 
# $$
# \mathbf{f}(\x)=w_{i-(n-1)},\ldots,w_{i-1}
# $$
# 
# $\prob(\text{bigly}|\text{...,blah, blah, blah, we, will, win}) 
# = \prob(\text{bigly}|\text{we, will, win})$

# ### Unigram LM
# 
# Set $n=1$:
# $$
# \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i).
# $$
# 
# $\prob(\text{bigly}|\text{we, will, win}) = \prob(\text{bigly})$

# ## Uniform LM
# Same probability for each word in a *vocabulary* \\(\vocab\\):
# 
# $$
# \prob(w_i|w_1,\ldots,w_{i-1}) = \frac{1}{|\vocab|}.
# $$
# 
# $\prob(\text{big}) = \prob(\text{bigly}) = \frac{1}{|\vocab|}$

# Let us look at a training set and create a uniform LM from it.

# In[2]:


train[:10]


# In[3]:


vocab = set(train)
baseline = UniformLM(vocab)
sum([baseline.probability(w) for w in vocab])


# What about other words? Summing up probabilities?

# ## Sampling
# * Sampling from an LM is easy and instructive
# * Usually, the better the LM, the better the samples

# Sample **incrementally**, one word at a time 

# In[4]:


def sample_once(lm, history, words):
    probs = [lm.probability(word, *history) for word in words]
    return np.random.choice(words,p=probs)


# In[5]:


sample_once(baseline, [], list(baseline.vocab))    


# In[6]:


def sample(lm, initial_history, amount_to_sample):
    words = list(lm.vocab)
    result = []
    result += initial_history
    for _ in range(0, amount_to_sample):
        history = result[-(lm.order - 1):]
        result.append(sample_once(lm,history,words))
    return result


# In[7]:


sample(baseline, [], 10)


# ## Evaluation
# * **Extrinsic**: how it improves a downstream task?
# * **Intrinsic**: how good does it model language?

# ## Intrinsic Evaluation
# **Shannon Game**: Predict next word, win if prediction match words in actual corpus (or you gave it high probability)
# 
# > Our horrible trade agreements with [???]
# 
# Formalised by ...

# ### Perplexity 
# Given test sequence \\(w_1,\ldots,w_T\\), perplexity \\(\perplexity\\) is **geometric mean of inverse probabilities**:
# 
# \begin{align}
# \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\frac{1}{\prob(w_1)} \frac{1}{\prob(w_2|w_1)} \ldots} \\
# &= \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_{i-n},\ldots,w_{i-1})}}
# \end{align}

# ### Interpretation
# 
# Consider LM where 
# * at each position there are exactly **2** words with $\frac{1}{2}$ probability
# * in test sequence, one of these is always the true word 
<svg height="250" width="100%"><desc>Created with Snap</desc><defs><filter id="Sj8mewq2e4o" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker><filter id="Sj8mfrvs23" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker></defs><g id="drup_elem_1" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,2,0)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,2,0)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,2,0)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,2,0)" data-src="she">she</text></g><g id="drup_elem_6" class="drupElem egal-select"><rect x="66.578125" y="35.1875" width="134" height="58" fill="lightgrey" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,185,0)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,185,0)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,185,0)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,185,0)" data-src="raps">raps</text></g><g id="drup_elem_7" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="red" vector-effect="non-scaling-stroke" style="stroke-width: 5;" class="core alignable sub" transform="matrix(1,0,0,1,367,0)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,367,0)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,367,0)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,367,0)" data-src="bigly">bigly</text></g><g id="drup_elem_2" class="drupElem egal-select"><rect x="66.578125" y="35.1875" width="134" height="58" fill="lightgrey" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,2,150)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,2,150)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,2,150)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,2,150)" data-src="Matko">Matko</text></g><g id="drup_elem_4" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="red" vector-effect="non-scaling-stroke" style="stroke-width: 5;" class="core alignable sub" transform="matrix(1,0,0,1,184,150)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,184,150)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,184,150)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,184,150)" data-src="won">won</text></g><g id="drup_elem_9" class="drupElem egal-select"><rect x="66.578125" y="35.1875" width="134" height="58" fill="lightgrey" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,367,150)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,367,150)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_9_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,367,150)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,367,150)" data-src="strongly">strongly</text></g><g id="drup_elem_3" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="red" vector-effect="non-scaling-stroke" style="stroke-width: 5;" class="core alignable sub" transform="matrix(1,0,0,1,2,75)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,2,75)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,2,75)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,2,75)" data-src="he">he</text></g><g id="drup_elem_5" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,184,75)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,184,75)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,184,75)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,184,75)" data-src="lost">lost</text></g><g id="drup_elem_8" class="drupElem"><rect x="66.578125" y="35.1875" width="134" height="58" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 0;" class="core alignable sub" transform="matrix(1,0,0,1,367,75)"></rect><circle cx="133.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="133.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="66.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="200.578125" cy="64.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="66.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="66.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="200.578125" cy="35.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,367,75)"></circle><circle cx="200.578125" cy="93.1875" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_8_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,367,75)"></circle><text x="133.578125" y="64.1875" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,367,75)" data-src="smally">smally</text></g><g id="drup_elem_11" class="drupElem"><text x="675.578125" y="134.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central;"></text></g><g id="drup_elem_12" class="drupElem"><text x="652.578125" y="119.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g><g id="drup_elem_13" class="drupElem"><text x="659.578125" y="104.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g><g id="drup_elem_15" class="drupElem"><text x="43.578125" y="67.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="0.5">0.5</text></g><g id="drup_elem_18" class="drupElem"><text x="43.578125" y="67.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="0.5" transform="matrix(1,0,0,1,0,72)">0.5</text></g><g id="drup_elem_23" class="drupElem"><text x="43.578125" y="67.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px; visibility: visible; opacity: 1;" data-src="0.0" transform="matrix(1,0,0,1,0,146)">0.0</text></g><g id="drup_elem_16" class="drupElem"><text x="39.578125" y="94.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g><g id="drup_elem_17" class="drupElem"><text x="52.578125" y="72.1875" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g></svg>
# Then 
# 
# * $\perplexity(w_1,\ldots,w_T) = \sqrt[T]{2 \cdot 2  \cdot\ldots} = 2$
# * Perplexity $\approx$ average number of choices

# Perplexity of uniform LM on an **unseen** test set?

# In[8]:


perplexity(baseline, test)


# Problem: model assigns **zero probability** to words not in the vocabulary. 

# In[9]:


[(w,baseline.probability(w)) for w in test if w not in vocab][:3]


# ## The Long Tail
# New words not specific to our corpus: 
# * long **tail** of words that appear only a few times
# * each has low probability, but probability of seeing any long tail word is high
# 

# Let us plot word frequency ranks (x-axis) against frequency (y-axis) 

# In[10]:


plt.xscale('log')
plt.yscale('log') 
plt.plot(ranks, sorted_counts)


# In log-space such rank vs frequency graphs are **linear** 
# 
# * Known as **Zipf's Law**
# 
# Let $r_w$ be the rank of a word \\(w\\), and \\(f_w\\) its frequency:
# 
# $$
#   f_w \propto \frac{1}{r_w}.
# $$
# 
# * Also true in [random text](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.8422&rep=rep1&type=pdf)

# ## Out-of-Vocabularly (OOV) Tokens
# There will always be words with zero counts in your training set.
# 
# Solutions:
# * Remove unseen words from test set (bad)
# * Move probability mass to unseen words (good, discuss later)
# * Replace unseen words with out-of-vocabularly token, estimate its probability

# ### Inserting OOV Tokens

# In[11]:


replace_OOVs(baseline.vocab, test[:10])


# What happens to perplexity if training set is small?

# ### Estimate `OOV` probability
# What is the probability of seeing a word you haven't seen before?

# Consider the "words"
# 
# > AA AA BB BB AA
# 
# Going left to right, how often do I see new words?

# Inject `OOV` tokens to mark these "new word events"

# In[12]:


inject_OOVs(["AA","AA","BB","BB","AA"])


# Now train on replaced data...

# In[13]:


oov_train = inject_OOVs(train)
oov_vocab = set(oov_train)
oov_test = replace_OOVs(oov_vocab, test)
oov_baseline = UniformLM(oov_vocab)
perplexity(oov_baseline,oov_test)


# What does this perplexity correspond to?

# ## Training N-Gram Language Models

# N-gram language models condition on a limited history: 
# 
# $$
# \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-(n-1)},\ldots,w_{i-1}).
# $$
# 
# What are its parameters (continuous values that control its behaviour)?

# One parameter $\param_{w,h}$ for each word $w$ and history $h=w_{i-(n-1)},\ldots,w_{i-1}$ pair:
# 
# $$
# \prob_\params(w|h) = \param_{w,h}
# $$
# 
# $\prob_\params(\text{bigly}|\text{win}) = \param_{\text{bigly, win}}$

# ### Maximum Likelihood Estimate
# 
# Assume training set \\(\train=(w_1,\ldots,w_d)\\)

# ### Maximum Likelihood Estimate
# 
# Find \\(\params\\) that maximizes the log-likelihood of \\(\train\\):
# 
# $$
# \params^* = \argmax_\params \log p_\params(\train)
# $$
# 
# where
# 
# $$
# \prob_\params(\train) = \ldots \prob_\params(w_i|\ldots w_{i-1}) \prob_\params(w_{i+1}|\ldots w_{i}) \ldots 
# $$
# 
# **Structured Prediction**: this is your continuous optimization problem!

# Maximum-log-likelihood estimate (MLE) can be calculated in **[closed form](/notebooks/chapters/mle.ipynb)**:
# $$
# \prob_{\params^*}(w|h) = \param^*_{w,h} = \frac{\counts{\train}{h,w}}{\counts{\train}{h}} 
# $$
# 
# where 
# 
# $$
# \counts{D}{e} = \text{Count of } e \text{ in }  D 
# $$
# 
# Many LM variants: different estimation of counts. 

# ## Training a Unigram Model
# Let us train a unigram model...

# What do you think the most probable words are? 
# 
# Remember our training set looks like this ...

# In[14]:


oov_train[10000:10010]


# In[15]:


unigram = NGramLM(oov_train,1)
plot_probabilities(unigram)
# sum([unigram.probability(w) for w in unigram.vocab])


# The unigram LM has substantially reduced (and hence better) perplexity:

# In[16]:


perplexity(unigram,oov_test)


# Its samples look (a little) more reasonable:

# In[17]:


sample(unigram, [], 10)


# ## Bigram Model
# We can do better by setting $n=2$

# In[18]:


bigram = NGramLM(oov_train,2)
plot_probabilities(laplace_bigram, ("FIND",)) #I, FIND, laplace .. 


# Samples should look (slightly) more fluent:

# In[19]:


" ".join(sample(laplace_bigram, ['FIND'], 30)) # try: I, FIND


# How about perplexity?

# In[20]:


perplexity(bigram,oov_test)


# Some contexts where OOV word (and others) haven't been seen, hence 0 probability...

# In[21]:


bigram.probability("[OOV]","money")


# ## Smoothing
# 
# Maximum likelihood 
# * **underestimates** true probability of some words 
# * **overestimates** the probabilities of other
# 
# Solution: _smooth_ the probabilities and **move mass** from seen to unseen events.
<svg height="400" width="100%"><desc>Created with Snap</desc><defs><filter id="Sj8iyg5my3" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker><filter id="Sj8p3bl8w3b" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker></defs><g id="drup_elem_1" class="drupElem"><rect x="71.578125" y="75.90625" width="156" height="72" fill="lightblue" stroke="white" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1,0,-7)"></rect><circle cx="149.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="149.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="71.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="227.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="71.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="71.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="227.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,0,-7)"></circle><circle cx="227.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,0,-7)"></circle><text x="149.578125" y="111.90625" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" data-src="Count=1" transform="matrix(1,0,0,1,0,-7)">Count=1</text></g><g id="drup_elem_3" class="drupElem"><rect x="71.578125" y="75.90625" width="156" height="72" fill="lightgrey" stroke="white" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.7308,0,0,1,19.271,170)"></rect><circle cx="149.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-20.9999,170)"></circle><circle cx="149.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-20.9999,170)"></circle><circle cx="71.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,0,170)"></circle><circle cx="227.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-41.9999,170)"></circle><circle cx="71.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,0,170)"></circle><circle cx="71.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,0,170)"></circle><circle cx="227.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-41.9999,170)"></circle><circle cx="227.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_3_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-41.9999,170)"></circle><text x="149.578125" y="111.90625" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" data-src="Count=0" transform="matrix(1,0,0,1,-20.9978,170)">Count=0</text></g><g id="drup_elem_5" class="drupElem"><rect x="71.578125" y="75.90625" width="156" height="72" fill="lightblue" stroke="white" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.6923,0,0,1,144.024,170)"></rect><circle cx="149.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,98,170)"></circle><circle cx="149.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,98,170)"></circle><circle cx="71.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,122,170)"></circle><circle cx="227.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,74,170)"></circle><circle cx="71.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,122,170)"></circle><circle cx="71.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,122,170)"></circle><circle cx="227.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,74,170)"></circle><circle cx="227.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,74,170)"></circle><text x="149.578125" y="111.90625" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" data-src="Count=1" transform="matrix(1,0,0,1,98.0024,170)">Count=1</text></g><g id="drup_elem_2" class="drupElem egal-select"><rect x="71.578125" y="75.90625" width="156" height="72" fill="orange" stroke="white" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(2.6538,0,0,1,42.6208,-7)"></rect><circle cx="149.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,290,-7)"></circle><circle cx="149.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,290,-7)"></circle><circle cx="71.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,161,-7)"></circle><circle cx="227.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,419,-7)"></circle><circle cx="71.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,161,-7)"></circle><circle cx="71.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,161,-7)"></circle><circle cx="227.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,419,-7)"></circle><circle cx="227.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,419,-7)"></circle><text x="149.578125" y="111.90625" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" data-src=">1" transform="matrix(1,0,0,1,289.9871,-7)">&gt;1</text></g><g id="drup_elem_4" class="drupElem"><rect x="71.578125" y="75.90625" width="156" height="72" fill="orange" stroke="white" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(2.1666,0,0,1,153.4922,170)"></rect><circle cx="149.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,327.9987,170)"></circle><circle cx="149.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,327.9987,170)"></circle><circle cx="71.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,236.9994,170)"></circle><circle cx="227.578125" cy="111.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,418.9981,170)"></circle><circle cx="71.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,236.9994,170)"></circle><circle cx="71.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,236.9994,170)"></circle><circle cx="227.578125" cy="75.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,418.9981,170)"></circle><circle cx="227.578125" cy="147.90625" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_4_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,418.9981,170)"></circle><text x="149.578125" y="111.90625" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" data-src=">1" transform="matrix(1,0,0,1,327.9896,170)">&gt;1</text></g><g id="drup_elem_6" class="drupElem"><text x="260.578125" y="54.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px; visibility: visible; opacity: 1;" data-src="Probability Mass per count-class for MLE " transform="matrix(1,0,0,1,87,-11)">Probability Mass per count-class for MLE </text></g><g id="drup_elem_7" class="drupElem"><text x="296.578125" y="44.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central;"></text></g><g id="drup_elem_8" class="drupElem"><text x="296.578125" y="64.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g><g id="drup_elem_9" class="drupElem"><text x="373.578125" y="166.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central;"></text></g><g id="drup_elem_10" class="drupElem"><text x="213.578125" y="64.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g><g id="drup_elem_11" class="drupElem"><text x="299.578125" y="221.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="Smoothing" transform="matrix(1,0,0,1,48,-4)">Smoothing</text></g><g id="drup_elem_12" class="drupElem"><text x="467.578125" y="221.90625" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;"></text></g></svg>
# ### Laplace Smoothing
# 
# Add **pseudo counts** to each event in the dataset 
# 
# $$
# \param^{\alpha}_{w,h} = \frac{\counts{\train}{h,w} + \alpha}{\counts{\train}{h} + \alpha \lvert V \rvert } 
# $$
# 
# Bayesian view: *maximum posteriori* estimate under a dirichlet prior on parameters.

# In[22]:


laplace_bigram = LaplaceLM(bigram, 0.1) 
laplace_bigram.probability("[OOV]","money")


# Perplexity should look better now:

# In[23]:


perplexity(LaplaceLM(bigram, 0.001),oov_test)


# ### Example
# Consider three events:

# In[24]:


c  = ["word",         "train count", "MLE",  "Laplace", "Same Denominator"]
r1 = ["smally",       "0",           "0/3",    "1/6",       "0.5/3"]
r2 = ["bigly",        "1",           "1/3",    "2/6",       "1/3"]
r3 = ["tremendously", "2",           "2/3",    "3/6",       "1.5/3"]
util.Table([r1,r2,r3], column_names=c)


# How is mass moved for Laplace Smoothing?
<svg height="350" width="100%"><g id="drup_elem_10" class="drupElem"><rect x="71.578125" y="234.8125" width="126" height="92" fill="transparent" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1.0159,0,0,0.9565,-3.1362,14.2092)"></rect><circle cx="134.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-1,4)"></circle><circle cx="134.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-1,0)"></circle><circle cx="71.578125" cy="280.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,-2,2)"></circle><circle cx="197.578125" cy="280.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,0,2)"></circle><circle cx="71.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,-2,4)"></circle><circle cx="71.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,-2,0)"></circle><circle cx="197.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,0,4)"></circle><circle cx="197.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,0,0)"></circle><text x="134.578125" y="280.8125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 0;" class="egal-label sub" transform="matrix(1,0,0,1,-1.0002,1.9891)">|</text></g><desc>Created with Snap</desc><defs><filter id="Sj8p12k397a" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker><filter id="Sj8p3bl8w5d" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker></defs><g id="drup_elem_1" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="lightblue" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1,-5,44)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-5,44)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,-5,44)" data-src="Count=1">Count=1</text></g><g id="drup_elem_5" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.4111,0,0,0.7865,47.7404,232.4431)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-74.4992,228)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-74.4992,209)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,5.0003,218.5)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-153.9987,218.5)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,5.0003,228)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,5.0003,209)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-153.9987,228)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-153.9987,209)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 0; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,-74.4911,218.4466)">|</text></g><g id="drup_elem_6" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="lightblue" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1,134,218)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,134,218)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,134,218)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,134,218)" data-src="=1">=1</text></g><g id="drup_elem_7" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1.3185,0,0,1,388.8825,218)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,455,218)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,455,218)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,412,218)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,498,218)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,412,218)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,412,218)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,498,218)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,498,218)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,454.995,218)" data-src="=2">=2</text></g><g id="drup_elem_2" class="drupElem"><rect x="350.578125" y="20.8125" width="490" height="88" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1.0115,0,43.7635)"></rect><circle cx="595.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="595.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><circle cx="350.578125" cy="64.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,0,44.4998)"></circle><circle cx="840.578125" cy="64.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,0,44.4998)"></circle><circle cx="350.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="350.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><circle cx="840.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="840.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><text x="595.578125" y="64.8125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,0,44.5026)" data-src="Count=2">Count=2</text></g><g id="drup_elem_8" class="drupElem egal-select"><text x="350.578125" y="51.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px; visibility: visible; opacity: 1;" data-src="probability for count class in Maximum Likelihood" transform="matrix(1,0,0,1,91,-17)">probability for count class in Maximum Likelihood</text></g><g id="drup_elem_9" class="drupElem"><text x="321.578125" y="189.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="Laplace" transform="matrix(1,0,0,1,120,19)">Laplace</text></g><g id="drup_elem_11" class="drupElem"><text x="119.578125" y="231.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="=0" transform="matrix(1,0,0,1,14,-13)">=0</text></g></svg>
# Events with higher counts get penalised more!
# 
# Is this consistent with how counts behave on an unseen test?

# In[25]:


util.Table(frame, column_names = ["Train Count", "Test Count", "Laplace Count"], number_format="{0:.4f}")


# Penalty is closer to being constant than increasing by count:
# * Test counts usually between 0.6 and 1.4 smaller than train counts
# * In larger datasets this can be a constant!

# So "real" re-allocation looks more like this:
<svg height="350" width="100%"><g id="drup_elem_10" class="drupElem"><rect x="71.578125" y="234.8125" width="126" height="92" fill="transparent" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1.0159,0,0,0.9565,-3.1362,14.2092)"></rect><circle cx="134.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-1,4)"></circle><circle cx="134.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-1,0)"></circle><circle cx="71.578125" cy="280.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,-2,2)"></circle><circle cx="197.578125" cy="280.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,0,2)"></circle><circle cx="71.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,-2,4)"></circle><circle cx="71.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,-2,0)"></circle><circle cx="197.578125" cy="234.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,0,4)"></circle><circle cx="197.578125" cy="326.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,0,0)"></circle><text x="134.578125" y="280.8125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 0;" class="egal-label sub" transform="matrix(1,0,0,1,-1.0002,1.9891)">|</text></g><desc>Created with Snap</desc><defs><filter id="Sj8p12k397a" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker><filter id="Sj8p3bl8w7h" filterUnits="userSpaceOnUse"><feGaussianBlur in="SourceAlpha" stdDeviation="3"></feGaussianBlur><feOffset dx="0" dy="2" result="offsetblur"></feOffset><feFlood flood-color="#000000"></feFlood><feComposite in2="offsetblur" operator="in"></feComposite><feComponentTransfer><feFuncA type="linear" slope="1"></feFuncA></feComponentTransfer><feMerge><feMergeNode></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="9" refY="3" id="arrowEndMarker"><polygon points="0,0,0,6,9,3,0,0" fill="#323232" id="arrow" style=""></polygon></marker><marker viewBox="0 0 10 10" markerWidth="10" markerHeight="10" orient="auto" refX="0" refY="3" id="arrowStartMarker"><polygon points="0,3,9,0,9,6,0,3" fill="#323232" id="startArrow" style=""></polygon></marker></defs><g id="drup_elem_1" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="lightblue" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1,-5,44)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-5,44)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_1_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-5,44)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,-5,44)" data-src="Count=1">Count=1</text></g><g id="drup_elem_5" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="lightblue" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.1963,0,0,0.7865,63.3313,232.4431)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-103.5009,228)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-103.5009,209)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,4.9997,218.5)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-212.0014,218.5)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,4.9997,228)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,4.9997,209)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-212.0014,228)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_5_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-212.0014,209)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 0; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,-103.4888,218.4466)">|</text></g><g id="drup_elem_10" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.1963,0,0,0.7865,122.3313,232.4431)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,-44.5009,228)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,-44.5009,209)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,63.9997,218.5)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,-153.0014,218.5)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,63.9997,228)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,63.9997,209)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,-153.0014,228)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_10_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,-153.0014,209)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 0; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,-44.4888,218.4466)">|</text></g><g id="drup_elem_6" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="lightblue" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(0.7074,0,0,1,155.2359,218)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,94.5009,218)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,94.5009,218)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,134.0003,218)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,55.0015,218)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,134.0003,218)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,134.0003,218)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,55.0015,218)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_6_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,55.0015,218)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,94.5032,218)" data-src="=1">=1</text></g><g id="drup_elem_7" class="drupElem"><rect x="72.578125" y="20.8125" width="270" height="89" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1.6074,0,0,1,289.9155,218)"></rect><circle cx="207.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,416.0015,218)"></circle><circle cx="207.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,416.0015,218)"></circle><circle cx="72.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,334.0005,218)"></circle><circle cx="342.578125" cy="65.3125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,498.0024,218)"></circle><circle cx="72.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,334.0005,218)"></circle><circle cx="72.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,334.0005,218)"></circle><circle cx="342.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,498.0024,218)"></circle><circle cx="342.578125" cy="109.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_7_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,498.0024,218)"></circle><text x="207.578125" y="65.3125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,415.9936,218)" data-src="=2">=2</text></g><g id="drup_elem_2" class="drupElem"><rect x="350.578125" y="20.8125" width="490" height="88" fill="orange" stroke="#000000" vector-effect="non-scaling-stroke" style="stroke-width: 1;" class="core alignable sub" transform="matrix(1,0,0,1.0115,0,43.7635)"></rect><circle cx="595.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_0" style="stroke-width: 1; opacity: 0;" class="endPoint up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="595.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_1" style="stroke-width: 1; opacity: 0;" class="endPoint down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><circle cx="350.578125" cy="64.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_2" style="stroke-width: 1; opacity: 0;" class="endPoint left sub" transform="matrix(1,0,0,1,0,44.4998)"></circle><circle cx="840.578125" cy="64.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_3" style="stroke-width: 1; opacity: 0;" class="endPoint right sub" transform="matrix(1,0,0,1,0,44.4998)"></circle><circle cx="350.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_4" style="stroke-width: 1; opacity: 0;" class="endPoint left-up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="350.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_5" style="stroke-width: 1; opacity: 0;" class="endPoint left-down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><circle cx="840.578125" cy="20.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_6" style="stroke-width: 1; opacity: 0;" class="endPoint right-up sub" transform="matrix(1,0,0,1,0,44)"></circle><circle cx="840.578125" cy="108.8125" r="5" stroke="#000000" fill="#ffffff" id="drup_elem_2_endpoint_7" style="stroke-width: 1; opacity: 0;" class="endPoint right-down sub" transform="matrix(1,0,0,1,0,44.9996)"></circle><text x="595.578125" y="64.8125" style="font-size: 20px; text-anchor: middle; alignment-baseline: central; opacity: 1; visibility: visible;" class="egal-label sub" transform="matrix(1,0,0,1,0,44.5026)" data-src="Count=2">Count=2</text></g><g id="drup_elem_8" class="drupElem"><text x="350.578125" y="51.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="Maximum Likelihood" transform="matrix(1,0,0,1,91,-17)">Maximum Likelihood</text></g><g id="drup_elem_9" class="drupElem egal-select"><text x="321.578125" y="189.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px; visibility: visible; opacity: 1;" data-src="Test Set Reallocation" transform="matrix(1,0,0,1,119,19)">Test Set Reallocation</text></g><g id="drup_elem_11" class="drupElem"><text x="119.578125" y="231.8125" class="core alignable sub egal-label" style="text-anchor: middle; alignment-baseline: central; font-size: 20px;" data-src="=0" transform="matrix(1,0,0,1,14,-13)">=0</text></g></svg>
# ### Interpolation
# * Laplace Smoothing assigns mass **uniformly** to the words that haven't been seen in a context.

# In[26]:


laplace_bigram.probability('rhyme','man'), \
laplace_bigram.probability('of','man')


# Not all unseen words (in a context) are equal

# With **interpolation** We can do better: 
# * give more mass to words likely under the $n-1$-gram model. 
#     * Use $\prob(\text{of})$ for estimating $\prob(\text{of} | \text{man})$
# * Combine $n$-gram model \\(p'\\) and a back-off \\(n-1\\) model \\(p''\\): 
# 
# $$
# \prob_{\alpha}(w_i|w_{i-n+1},\ldots,w_{i-1}) = \alpha \cdot \prob'(w_i|w_{i-n+1},\ldots,w_{i-1}) + \\ (1 - \alpha) \cdot \prob''(w_i|w_{i-n+2},\ldots,w_{i-1})
# $$
# 

# In[27]:


interpolated = InterpolatedLM(bigram,unigram,0.01)
interpolated.probability('rhyme','man'), \
interpolated.probability('of','man')


# Can we find a good $\alpha$ parameter? Tune on some **development set**!

# In[28]:


alphas = np.arange(0,1.1,0.1)
perplexities = [perplexity(InterpolatedLM(bigram,unigram,alpha),oov_test) 
                for alpha in alphas]
plt.plot(alphas,perplexities)


# ### Backoff 
# * When we have counts for an event, trust these counts and not the simpler model
#     * use $\prob(\text{bigly}|\text{win})$ if you have seen $(\text{win, bigly})$, not $\prob(\text{bigly})$
# * **back-off** only when no counts for a given event are available.

# ### Stupid Backoff
# Let \\(w\\) be a word and \\(h_{m}\\) an n-gram of length \\(m\\):  
# 
# $$
# \prob_{\mbox{Stupid}}(w|h_{m}) = 
# \begin{cases}
# \frac{\counts{\train}{h_{m},w}}{\counts{\train}{h_{m}}}  &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\
# \prob_{\mbox{Stupid}}(w|h_{m-1}) & \mbox{otherwise}
# \end{cases}
# $$

# What is the problem with this model?

# In[29]:


stupid = StupidBackoff(bigram, unigram, 0.1)
sum([stupid.probability(word, 'the') for word in stupid.vocab])


# ### Absolute Discounting
# Recall that in test data, a constant probability mass is taken away for each non-zero count event. Can this be captured in a smoothing algorithm?

# Yes: subtract (tunable) constant $d$ from each non-zero probability:
# 
# $$
# \prob_{\mbox{Absolute}}(w|h_{m}) = 
# \begin{cases}
# \frac{\counts{\train}{h_{m},w}-d}{\counts{\train}{h_{m}}}   &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\
# \alpha(h_{m-1})\cdot\prob_{\mbox{Absolute}}(w|h_{m-1}) & \mbox{otherwise}
# \end{cases}
# $$
# 
# $\alpha(h_{m-1})$ is a normalizer

# ### Unigram Backoff
# 
# Assume, for example:
# * *Mos Def* is a rapper name that appears often in the data
# * *glasses* appears slightly less often
# * neither Def nor glasses have been seen in the context of the word *reading*

# Then the final-backoff unigram model might assign a higher probability to
# 
# > I can't see without my reading Def
# 
# than
# 
# > I can't see without my reading glasses
# 
# because $\prob(\text{Def}) > \prob(\text{glasses})$

# But *Def* never follows anything but *Mos*, and we can determine this by looking at the training data!

# ### Knesser Ney Smoothing
# 
# Absolute Discounting, but use as final backoff probability the probability that a word appears after (any) word in the training set: 
# 
# $$
# \prob_{\mbox{KN}}(w) = \frac{\left|\{w_{-1}:\counts{\train}{w_{-1},w}> 1\}  \right|}
# {\sum_{w'}\left|\{w_{-1}:\counts{\train}{w_{-1},w'}\} > 1 \right|}  
# $$
# 
# This is the *continuation probability*

# ## Summary
# 
# * LMs model probability of sequences of words 
# * Defined in terms of "next-word" distributions conditioned on history
# * N-gram models truncate history representation
# * Often trained by maximizing log-likelihood of training data and ...
# * smoothing to deal with sparsity
# 

# ## Background Reading
# 
# * Jurafsky & Martin, Speech and Language Processing: Chapter 4, N-Grams.
# * Bill MacCartney, Stanford NLP Lunch Tutorial: [Smoothing](http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)