%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')
# import tikzmagic
%load_ext tikzmagic
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
calculate the probability of seeing a sequence of words.
For example: how likely is the following sequence?
We're going to win bigly.
Is it more likely than this one?
We're going to win big league.
We will win by a mile
or
We will win bigly
Models the probability
$$\prob(w_1,\ldots,w_d)$$of observing sequences of words $w_1,\ldots,w_d$.
Without loss of generality:
\begin{align} \prob(w_1,\ldots,w_d) &= p(w_1) p(w_2|w_1) p(w_3|w_1, w_2) \ldots \\ &= \prob(w_1) \prod_{i = 2}^d \prob(w_i|w_1,\ldots,w_{i-1}) \end{align}Impossible to estimate sensible probability for each history
$$ \x=w_1,\ldots,w_{i-1} $$truncate history to last $n-1$ words:
$$ \mathbf{f}(\x)=w_{i-(n-1)},\ldots,w_{i-1} $$$\prob(\text{bigly}|\text{...,blah, blah, blah, we, will, win}) = \prob(\text{bigly}|\text{we, will, win})$
Set $n=1$: $$ \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i). $$
$\prob(\text{bigly}|\text{we, will, win}) = \prob(\text{bigly})$
Same probability for each word in a vocabulary $\vocab$:
$$ \prob(w_i|w_1,\ldots,w_{i-1}) = \frac{1}{|\vocab|}. $$$\prob(\text{big}) = \prob(\text{bigly}) = \frac{1}{|\vocab|}$
Let us look at a training set and create a uniform LM from it.
train[:10]
['[BAR]', 'Can', "'t", 'even', 'call', 'this', 'a', 'blues', 'song', '[/BAR]']
vocab = set(train)
baseline = UniformLM(vocab)
sum([baseline.probability(w) for w in vocab])
0.9999999999999173
What about other words? Summing up probabilities?
Sample incrementally, one word at a time
def sample_once(lm, history, words):
probs = [lm.probability(word, *history) for word in words]
return np.random.choice(words,p=probs)
sample_once(baseline, [], list(baseline.vocab))
'east'
def sample(lm, initial_history, amount_to_sample):
words = list(lm.vocab)
result = []
result += initial_history
for _ in range(0, amount_to_sample):
history = result[-(lm.order - 1):]
result.append(sample_once(lm,history,words))
return result
sample(baseline, [], 10)
['sword', 'everytime', 'brutality', 'Well', 'breast', 'neck', 'alcohol', 'stiff', 'baron', 'system']
Shannon Game: Predict next word, win if prediction match words in actual corpus (or you gave it high probability)
Our horrible trade agreements with [???]
Formalised by ...
Given test sequence $w_1,\ldots,w_T$, perplexity $\perplexity$ is geometric mean of inverse probabilities:
\begin{align} \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\frac{1}{\prob(w_1)} \frac{1}{\prob(w_2|w_1)} \ldots} \\ &= \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_{i-n},\ldots,w_{i-1})}} \end{align}Consider LM where
Then
Perplexity of uniform LM on an unseen test set?
perplexity(baseline, test)
inf
Problem: model assigns zero probability to words not in the vocabulary.
[(w,baseline.probability(w)) for w in test if w not in vocab][:3]
[('does', 0.0), ('Ceremonies', 0.0), ('Masquerading', 0.0)]
New words not specific to our corpus:
Let us plot word frequency ranks (x-axis) against frequency (y-axis)
plt.xscale('log')
plt.yscale('log')
plt.plot(ranks, sorted_counts)
[<matplotlib.lines.Line2D at 0x7f81d8842da0>]
In log-space such rank vs frequency graphs are linear
Let $r_w$ be the rank of a word $w$, and $f_w$ its frequency:
$$ f_w \propto \frac{1}{r_w}. $$There will always be words with zero counts in your training set.
Solutions:
replace_OOVs(baseline.vocab, test[:10])
['[BAR]', 'scratched', '[/BAR]', '[BAR]', 'What', '[OOV]', 'it', 'take', '[/BAR]', '[BAR]']
What happens to perplexity if training set is small?
OOV
probability¶What is the probability of seeing a word you haven't seen before?
Consider the "words"
AA AA BB BB AA
Going left to right, how often do I see new words?
Inject OOV
tokens to mark these "new word events"
inject_OOVs(["AA","AA","BB","BB","AA"])
['[OOV]', 'AA', '[OOV]', 'BB', 'AA']
Now train on replaced data...
oov_train = inject_OOVs(train)
oov_vocab = set(oov_train)
oov_test = replace_OOVs(oov_vocab, test)
oov_baseline = UniformLM(oov_vocab)
perplexity(oov_baseline,oov_test)
1290.0000000049852
What does this perplexity correspond to?
N-gram language models condition on a limited history:
$$ \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-(n-1)},\ldots,w_{i-1}). $$What are its parameters (continuous values that control its behaviour)?
One parameter $\param_{w,h}$ for each word $w$ and history $h=w_{i-(n-1)},\ldots,w_{i-1}$ pair:
$$ \prob_\params(w|h) = \param_{w,h} $$$\prob_\params(\text{bigly}|\text{win}) = \param_{\text{bigly, win}}$
Assume training set $\train=(w_1,\ldots,w_d)$
Find $\params$ that maximizes the log-likelihood of $\train$:
$$ \params^* = \argmax_\params \log p_\params(\train) $$where
$$ \prob_\params(\train) = \ldots \prob_\params(w_i|\ldots w_{i-1}) \prob_\params(w_{i+1}|\ldots w_{i}) \ldots $$Structured Prediction: this is your continuous optimization problem!
Maximum-log-likelihood estimate (MLE) can be calculated in closed form: $$ \prob_{\params^*}(w|h) = \param^*_{w,h} = \frac{\counts{\train}{h,w}}{\counts{\train}{h}} $$
where
$$ \counts{D}{e} = \text{Count of } e \text{ in } D $$Many LM variants: different estimation of counts.
Let us train a unigram model...
What do you think the most probable words are?
Remember our training set looks like this ...
oov_train[10000:10010]
['[BAR]', 'I', '[OOV]', 'at', 'home', 'but', 'I', 'gotta', 'escape', '[/BAR]']
unigram = NGramLM(oov_train,1)
plot_probabilities(unigram)
# sum([unigram.probability(w) for w in unigram.vocab])
The unigram LM has substantially reduced (and hence better) perplexity:
perplexity(unigram,oov_test)
91.4414922652717
Its samples look (a little) more reasonable:
sample(unigram, [], 10)
['[OOV]', '[OOV]', '[OOV]', '[OOV]', '[OOV]', 'your', 'ways', 'is', '[BAR]', 'life']
We can do better by setting $n=2$
bigram = NGramLM(oov_train,2)
plot_probabilities(laplace_bigram, ("FIND",)) #I, FIND, laplace ..
Samples should look (slightly) more fluent:
" ".join(sample(laplace_bigram, ['FIND'], 30)) # try: I, FIND
'FIND coming music ten through [/BAR] [BAR] [/BAR] [BAR] stroke roam erased Maintain talk another same stay Listening commercial PE-DA-DEE-DA-DEE-DA-DEE-DA-DEECE Shit school passed pain great Singing sayin Hey PE album mathematics'
How about perplexity?
perplexity(bigram,oov_test)
inf
Some contexts where OOV word (and others) haven't been seen, hence 0 probability...
bigram.probability("[OOV]","money")
0.0
Maximum likelihood
Solution: smooth the probabilities and move mass from seen to unseen events.
Add pseudo counts to each event in the dataset
$$ \param^{\alpha}_{w,h} = \frac{\counts{\train}{h,w} + \alpha}{\counts{\train}{h} + \alpha \lvert V \rvert } $$Bayesian view: maximum posteriori estimate under a dirichlet prior on parameters.
laplace_bigram = LaplaceLM(bigram, 0.1)
laplace_bigram.probability("[OOV]","money")
0.0007692307692307692
Perplexity should look better now:
perplexity(LaplaceLM(bigram, 0.001),oov_test)
93.79512094464789
Consider three events:
c = ["word", "train count", "MLE", "Laplace", "Same Denominator"]
r1 = ["smally", "0", "0/3", "1/6", "0.5/3"]
r2 = ["bigly", "1", "1/3", "2/6", "1/3"]
r3 = ["tremendously", "2", "2/3", "3/6", "1.5/3"]
util.Table([r1,r2,r3], column_names=c)
word | train count | MLE | Laplace | Same Denominator |
---|---|---|---|---|
smally | 0 | 0/3 | 1/6 | 0.5/3 |
bigly | 1 | 1/3 | 2/6 | 1/3 |
tremendously | 2 | 2/3 | 3/6 | 1.5/3 |
How is mass moved for Laplace Smoothing?
Events with higher counts get penalised more!
Is this consistent with how counts behave on an unseen test?
util.Table(frame, column_names = ["Train Count", "Test Count", "Laplace Count"], number_format="{0:.4f}")
Train Count | Test Count | Laplace Count |
---|---|---|
0 | 0.0034 | 0.0050 |
1 | 0.4375 | 0.3019 |
2 | 1.1152 | 0.7546 |
3 | 1.6615 | 1.2125 |
4 | 2.6753 | 1.5728 |
5 | 4.1019 | 2.2806 |
6 | 4.6883 | 2.5625 |
7 | 5.3929 | 3.5774 |
Penalty is closer to being constant than increasing by count:
So "real" re-allocation looks more like this:
laplace_bigram.probability('rhyme','man'), \
laplace_bigram.probability('of','man')
(0.0005649717514124294, 0.0005649717514124294)
Not all unseen words (in a context) are equal
With interpolation We can do better:
interpolated = InterpolatedLM(bigram,unigram,0.01)
interpolated.probability('rhyme','man'), \
interpolated.probability('of','man')
(0.001156307129798903, 0.007390310786106033)
Can we find a good $\alpha$ parameter? Tune on some development set!
alphas = np.arange(0,1.1,0.1)
perplexities = [perplexity(InterpolatedLM(bigram,unigram,alpha),oov_test)
for alpha in alphas]
plt.plot(alphas,perplexities)
[<matplotlib.lines.Line2D at 0x7f81cf3f8550>]
Let $w$ be a word and $h_{m}$ an n-gram of length $m$:
$$ \prob_{\mbox{Stupid}}(w|h_{m}) = \begin{cases} \frac{\counts{\train}{h_{m},w}}{\counts{\train}{h_{m}}} &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\ \prob_{\mbox{Stupid}}(w|h_{m-1}) & \mbox{otherwise} \end{cases} $$What is the problem with this model?
stupid = StupidBackoff(bigram, unigram, 0.1)
sum([stupid.probability(word, 'the') for word in stupid.vocab])
1.0647115579930924
Recall that in test data, a constant probability mass is taken away for each non-zero count event. Can this be captured in a smoothing algorithm?
Yes: subtract (tunable) constant $d$ from each non-zero probability:
$$ \prob_{\mbox{Absolute}}(w|h_{m}) = \begin{cases} \frac{\counts{\train}{h_{m},w}-d}{\counts{\train}{h_{m}}} &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\ \alpha(h_{m-1})\cdot\prob_{\mbox{Absolute}}(w|h_{m-1}) & \mbox{otherwise} \end{cases} $$$\alpha(h_{m-1})$ is a normalizer
Assume, for example:
Then the final-backoff unigram model might assign a higher probability to
I can't see without my reading Def
than
I can't see without my reading glasses
because $\prob(\text{Def}) > \prob(\text{glasses})$
But Def never follows anything but Mos, and we can determine this by looking at the training data!
Absolute Discounting, but use as final backoff probability the probability that a word appears after (any) word in the training set:
$$ \prob_{\mbox{KN}}(w) = \frac{\left|\{w_{-1}:\counts{\train}{w_{-1},w}> 1\} \right|} {\sum_{w'}\left|\{w_{-1}:\counts{\train}{w_{-1},w'}\} > 1 \right|} $$This is the continuation probability