%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
import statnlpbook.sequence as seq
import matplotlib
import warnings
warnings.filterwarnings('ignore')
matplotlib.rcParams['figure.figsize'] = (8.0, 5.0)
assigns labels to each element in a sequence
assign each token in a sentence its Part-of-Speech tag
I | predict | I | won't | win | a | single | |
---|---|---|---|---|---|---|---|
O | V | O | V | V | D | A | N |
label tokens as beginning (B), inside (I) our outside (O) a named entity
Barack | Obama | was | born | in | |
---|---|---|---|---|---|
B-PER | I-PER | O | O | O | B-LOC |
model probability distributions over label sequences $\y$ conditioned on input sequences $\x$ $$ s_{\params}(\x,\y) = \prob_\params(\y|\x) $$
We us PoS tagging for tweets and use the Tweebank dataset
train = seq.load_tweebank("../data/oct27.splits/oct27.train")
dev = seq.load_tweebank("../data/oct27.splits/oct27.dev")
test = seq.load_tweebank("../data/oct27.splits/oct27.test")
" ".join([w + "/" + t for w,t in zip(train[0][0],train[0][1])])
"I/O predict/V I/O won't/V win/V a/D single/A game/N I/O bet/V on/P ./, Got/V Cliff/^ Lee/^ today/N ,/, so/P if/P he/O loses/V its/L on/P me/O RT/~ @e_one/@ :/~ Texas/^ (/, cont/~ )/, http://tl.gd/6meogh/U"
Tags (such as "O", "V" and "^") are described in the Tweebank annotation guideline
# count tags here?`xw
from collections import defaultdict
import pandas as pd
examples = {}
counts = defaultdict(int)
words = defaultdict(set)
for x,y in train:
for i in range(0, len(x)):
if y[i] not in examples:
examples[y[i]] = [x[j] + "/" + y[j] if i == j else x[j] for j in range(max(i-2,0),min(i+2,len(x)-1))]
counts[y[i]] += 1
words[y[i]].add(x[i])
sorted_tags = sorted(counts.items(),key=lambda x:-x[1])
sorted_tags_with_examples = [(t,c,len(words[t])," ".join(examples[t])) for t,c in sorted_tags]
sorted_tags_table = pd.DataFrame(sorted_tags_with_examples, columns=['Tag','Count','Unique Words','Example'])
sorted_tags_table[:10]
Tag | Count | Unique Words | Example | |
---|---|---|---|---|
0 | V | 2219 | 873 | I predict/V I |
1 | N | 2003 | 1377 | a single game/N I |
2 | , | 1715 | 84 | bet on ./, Got |
3 | P | 1252 | 126 | I bet on/P . |
4 | O | 1063 | 97 | I/O predict |
5 | ^ | 890 | 741 | . Got Cliff/^ Lee |
6 | D | 869 | 68 | won't win a/D single |
7 | A | 755 | 449 | win a single/A game |
8 | @ | 713 | 694 | me RT @e_one/@ : |
9 | R | 689 | 217 | but I still/R hate |
A fully factorised or local model:
$$ p_\params(\y|\x) = \prod_{i=1}^n p_\params(y_i|\x,i) $$show
seq.draw_local_fg(7)
Sequence of classification models instead of single structured model
Learn classifier $p_\params(y\bar\x,i)$ to predict class for sentence $\x$ and position $i$
$$ p_\params(y\bar\x,i) = \frac{1}{Z_\x} \exp \langle \repr(\x,i),\params_y \rangle $$Word at token to tag: $$ \repr_w(\x,i) = \begin{cases}1 \text{ if }x_i=w \\\\ 0 \text{ else} \end{cases} $$
def feat_1(x,i):
return {
'word:' + str(x[i]): 1.0
}
local_1 = seq.LocalSequenceLabeler(feat_1, train)
We can assess the accuracy of this model on the development set.
seq.accuracy(dev, local_1.predict(dev))
0.6392286958324694
Look at confusion matrix
seq.plot_confusion_matrix(dev, local_1.predict(dev))
Shows:
N
receives a lot of wrong counts@
complete failureLet us start with @ ...
local_1.plot_lr_weights('@')
Features for specific users such as "word=@justinbieber" do not generalise well
How to address this?
def feat_2(x,i):
return {
**feat_1(x,i),
'first_at:' + str(x[i][0:1] == '@'): 1.0
}
local_2 = seq.LocalSequenceLabeler(feat_2, train)
seq.accuracy(dev, local_2.predict(dev))
0.7484967862326353
To confirm that these results actually from improved '@' prediction, let us look at the confusion matrix again.
seq.plot_confusion_matrix(dev, local_2.predict(dev))
Solved!
local_2.plot_lr_weights('@')
Other errors?
seq.plot_confusion_matrix(dev, local_2.predict(dev))
Look for errors with high frequency:
How do these errors look like?
util.Carousel(local_2.errors(dev[:10],
filter_guess=lambda y: y=='N',
filter_gold=lambda y: y=='^'))
players | and | his | wife | own | smash | burger |
N | & | D | N | V | ^ | ^ |
N | & | D | N | N | N | N |
first_at:False | word:smash |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
and | his | wife | own | smash | burger |
& | D | N | V | ^ | ^ |
& | D | N | N | N | N |
first_at:False | word:burger |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
RT | @HollywoodOompa | : | Sat | November | 6 | ill | be |
~ | @ | ~ | ^ | ^ | $ | L | V |
~ | @ | ~ | N | ^ | $ | N | V |
first_at:False | word:Sat |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
November | 6 | ill | be | at | Nashville | center | stage | for | the |
^ | $ | L | V | P | ^ | N | N | P | D |
^ | $ | N | V | P | N | N | N | P | D |
first_at:False | word:Nashville |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
, | Rising | . | Temperature | 56.3 | °F | . | Rain | today | 0.00 |
, | V | , | N | $ | ^ | , | N | N | $ |
, | N | , | N | N | N | , | N | N | N |
first_at:False | word:°F |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
me | for | blowing | up | your | youtube | comment | section | . |
O | P | V | T | D | ^ | N | N | , |
O | P | N | T | D | N | N | N | , |
first_at:False | word:youtube |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
, | but | didn't | bother | calling | Shawn | because | I'd | just | be |
, | & | V | V | V | ^ | P | L | R | V |
, | & | V | N | V | N | P | L | R | V |
first_at:False | word:Shawn |
1.0 | 1.0 |
-2.63 | 0.00 |
-1.53 | 0.00 |
Proper nouns tend to be capitalised!
def feat_3(x,i):
return {
**feat_2(x,i),
'is_lower:' + str(x[i].islower()): 1.0
}
local_3 = seq.LocalSequenceLabeler(feat_3, train, C=10)
seq.accuracy(dev, local_3.predict(dev))
0.771511507360564
This improvement indeed comes from being able to identify proper nouns when they are capitalised:
util.Carousel(local_3.errors(dev[:10],
filter_guess=lambda y: y=='N',
filter_gold=lambda y: y=='^'))
players | and | his | wife | own | smash | burger |
N | & | D | N | V | ^ | ^ |
N | & | D | N | N | N | N |
first_at:False | is_lower:True | word:smash |
1.0 | 1.0 | 1.0 |
-0.39 | -3.67 | 0.00 |
0.60 | -1.70 | 0.00 |
and | his | wife | own | smash | burger |
& | D | N | V | ^ | ^ |
& | D | N | N | N | N |
first_at:False | is_lower:True | word:burger |
1.0 | 1.0 | 1.0 |
-0.39 | -3.67 | 0.00 |
0.60 | -1.70 | 0.00 |
me | for | blowing | up | your | youtube | comment | section | . |
O | P | V | T | D | ^ | N | N | , |
O | P | N | T | D | N | N | N | , |
first_at:False | is_lower:True | word:youtube |
1.0 | 1.0 | 1.0 |
-0.39 | -3.67 | 0.00 |
0.60 | -1.70 | 0.00 |
Find more problems:
seq.plot_confusion_matrix(dev, local_3.predict(dev))
High frequency error:
Inspect examples...
util.Carousel(local_3.errors(dev[:20],
filter_guess=lambda y: y=='N',
filter_gold=lambda y: y=='V'))
the | players | and | his | wife | own | smash | burger |
D | N | & | D | N | V | ^ | ^ |
D | N | & | D | N | N | N | N |
first_at:False | is_lower:True | word:own |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | -0.78 |
0.60 | -1.70 | 2.30 |
RT | @TheRealQuailman | : | Currently | laughing | at | Laker | haters | . |
~ | @ | ~ | R | V | P | ^ | N | , |
~ | @ | ~ | ^ | N | P | ^ | N | , |
first_at:False | is_lower:True | word:laughing |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
@ShiversTheNinja | forgive | me | for | blowing | up |
@ | V | O | P | V | T |
@ | N | O | P | N | T |
first_at:False | is_lower:True | word:forgive |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
@ShiversTheNinja | forgive | me | for | blowing | up | your | youtube | comment |
@ | V | O | P | V | T | D | ^ | N |
@ | N | O | P | N | T | D | N | N |
first_at:False | is_lower:True | word:blowing |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
Question | : | How | CAN | you | mend | a | broken | heart | ? |
N | , | R | V | O | V | D | A | N | , |
^ | ~ | R | V | O | N | D | V | V | , |
first_at:False | is_lower:True | word:mend |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
last | night | , | but | didn't | bother | calling | Shawn | because | I'd |
A | N | , | & | V | V | V | ^ | P | L |
A | N | , | & | V | N | V | ^ | P | L |
first_at:False | is_lower:True | word:bother |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
are | in | ! | See | who | passed | and | who | made | the |
V | P | , | V | O | V | & | O | V | D |
V | P | , | V | O | N | & | O | V | D |
first_at:False | is_lower:True | word:passed |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
and | watch | the | news | and | tune | out | over | some | fresh |
& | V | D | N | & | V | T | P | D | A |
& | V | D | N | & | N | T | P | D | A |
first_at:False | is_lower:True | word:tune |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | -0.78 |
0.60 | -1.70 | 2.30 |
that | , | regretfully | I | was | tied | up | , | Physed | at |
P | , | R | O | V | V | T | , | N | P |
P | , | N | O | V | N | T | , | ^ | P |
first_at:False | is_lower:True | word:tied |
1.0 | 1.0 | 1.0 |
0.06 | -1.76 | 0.00 |
0.60 | -1.70 | 0.00 |
Suggests that word has not appeared (or not appeared as a verb) in the training set!
However, we can tell that these words may be verbs:
Incorporate as features!
def feat_4(x,i):
return {
**feat_3(x,i),
'last_3' + "".join(x[i][-3:]): 1.0,
'last_2' + "".join(x[i][-2:]): 1.0,
}
local_4 = seq.LocalSequenceLabeler(feat_4, train)
seq.accuracy(dev, local_4.predict(dev))
0.7876840140991085
util.Carousel(local_4.errors(dev[:20],
filter_guess=lambda y: y=='N',
filter_gold=lambda y: y=='V' ))
the | players | and | his | wife | own | smash | burger |
D | N | & | D | N | V | ^ | ^ |
D | N | & | D | N | N | V | N |
first_at:False | is_lower:True | last_2wn | last_3own | word:own |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
-0.64 | -2.54 | -0.16 | 0.07 | -0.28 |
0.28 | -2.43 | 1.01 | -0.45 | 2.63 |
Question | : | How | CAN | you | mend | a | broken | heart | ? |
N | , | R | V | O | V | D | A | N | , |
N | ~ | R | V | O | N | D | V | V | , |
first_at:False | is_lower:True | last_2nd | last_3end | word:mend |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
-0.64 | -2.54 | 0.38 | 2.05 | 0.00 |
0.28 | -2.43 | 1.01 | 1.57 | 0.00 |
last | night | , | but | didn't | bother | calling | Shawn | because | I'd |
A | N | , | & | V | V | V | ^ | P | L |
A | N | , | & | V | N | V | N | P | L |
first_at:False | is_lower:True | last_2er | last_3her | word:bother |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
-0.64 | -2.54 | -0.49 | -1.53 | 0.00 |
0.28 | -2.43 | 2.07 | -1.21 | 0.00 |
and | watch | the | news | and | tune | out | over | some | fresh |
& | V | D | N | & | V | T | P | D | A |
& | V | D | N | & | N | T | P | D | A |
first_at:False | is_lower:True | last_2ne | last_3une | word:tune |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
-0.64 | -2.54 | 0.16 | -0.27 | -0.27 |
0.28 | -2.43 | 0.94 | 1.48 | 1.48 |
We have dependencies between consecutive labels
after non-possessive pronoun ("O") such as "I" a verb ("V") is more likely than a noun ("N")
local model cannot capture this
a product of local logistic regression (aka Maximum Entropy) classifiers $\prob_\params(y_i|\x,y_{i-1},i)$
Log-linear version with access to previous label:
$$ p_\params(y_i|\x,y_{i-1},i) = \frac{1}{Z_{\x,y_{i-1},i}} \exp \langle \repr(\x,y_{i-1},i),\params_{y_i} \rangle $$where $Z_{\x,y_{i-1},i}=\sum_y \exp \langle \repr(\x,y_{i-1},i),\params_{y_i} \rangle $ is a local per-token normalisation factor.
seq.draw_transition_fg(7)
Optimising the conditional likelihood
$$ \sum_{(\x,\y) \in \train} \log \prob_\params(\y|\x) $$Decomposes nicely: $$ \sum_{(\x,\y) \in \train} \sum_{i=1}^{|\x|} \log \prob_\params(y_i|\x,y_{i-1},i) $$
Easy to train
To predict the best label sequence find a $\y^*$ with maximal conditional probability
$$ \y^* =\argmax_\y \prob_\params(\y|\x). $$We cannot simply choose each label in isolation because decisions depend on each other
Simple alternative:
def memm_greedy_predict(memm: seq.MEMMSequenceLabeler, data):
result = []
for x, y in data:
y_guess = []
for i in range(0, len(x)):
prediction = memm.predict_next(x, i, y_guess)
y_guess.append(prediction)
result.append(y_guess)
return result
def memm_greedy_predict(memm: seq.MEMMSequenceLabeler, data, use_gold_history=False):
result = []
for x, y in data:
y_guess = []
for i in range(0, len(x)):
prediction = memm.predict_next(x, i, y_guess if not use_gold_history else y)
y_guess.append(prediction)
result.append(y_guess)
return result
Let's specify a MEMM using
def memm_feat_1(x,i,hist):
return {
**feat_4(x,i),
'prev_y': hist[0],
}
memm_1 = seq.MEMMSequenceLabeler(memm_feat_1, train, order=1, C=10)
seq.accuracy(dev,memm_greedy_predict(memm_1, dev))
0.809454696247149
Some Noun vs Verb errors fixed:
util.Carousel(seq.errors(dev[:20], memm_greedy_predict(memm_1, dev[:20]),
'V', 'N',model=memm_1))
the | players | and | his | wife | own | smash | burger |
D | N | & | D | N | V | ^ | ^ |
D | N | & | D | N | N | V | N |
first_at:False | is_lower:True | last_2wn | last_3own | prev_y | word:own |
1.0 | 1.0 | 1.0 | 1.0 | N | 1.0 |
-0.79 | -2.64 | -0.17 | -0.02 | 0.00 | -0.01 |
0.01 | -2.52 | 1.10 | -0.53 | 0.00 | 1.41 |
and | watch | the | news | and | tune | out | over | some | fresh |
& | V | D | N | & | V | T | P | D | A |
& | V | D | N | & | N | P | P | D | A |
first_at:False | is_lower:True | last_2ne | last_3une | prev_y | word:tune |
1.0 | 1.0 | 1.0 | 1.0 | & | 1.0 |
-0.79 | -2.64 | -0.06 | -0.01 | 0.00 | -0.01 |
0.01 | -2.52 | 0.59 | 0.78 | 0.00 | 0.78 |
For the case of verbs ('V') we observe a high weight for $f_{\text{prev_y},\text{O}}$
memm_1.plot_lr_weights('V',feat_filter=lambda s: s.startswith("prev_"))
Greedy search approach is an approximate solution to $\argmax$ problem
One way to improve over greedy search is to maintain a beam of $k$-best solutions in each step
This enables initially weaker solutions to remain in the beam and move up the ranks in later steps in case they are more consistent with future observations
Technically a $k$-best beam search proceeds as follows. Let $L$ be the label set.
Note that a slightly faster version can use a priority queue.
In Python we can implement this algorithm like so:
def memm_beam_search(memm, x, width=2):
beam = [([],0.)]
history = [beam]
for i in range(0, len(x)):
# use priority queue
candidates = []
for (prev,score) in beam:
scores = memm.predict_scores(x, i, prev)
for label_index,label_score in enumerate(scores):
candidates.append((prev + [memm.labels()[label_index]], score + label_score))
beam = sorted(candidates, key=lambda x: -x[1])[:width]
history.append(beam)
return beam, history
def batch_predict(data, beam_predictor):
return [beam_predictor(x)[0][0][0] for x,y in data]
seq.accuracy(dev, batch_predict(dev, lambda x: memm_beam_search(memm_1, x, 10)))
0.8129794733568318
Is this because we already finding solutions with highest probability, or because higher probability doesn't necessarily mean higher accuracy?
We can test how many per-token predictions differ when comparing greedy search to a beam search of a given width, simply calculating their accuracies relative to each other:
seq.accuracy(memm_greedy_predict(memm_1, dev), batch_predict(dev, lambda x: memm_beam_search(memm_1, x, 10)))
0.975533900062202
We notice that about 4% of the tokens receive different labels, simply searching for higher scoring sequences
This suggest that we frequently find higher probability sequences, but that these are not necessarily more correct
We can also calculate the average log probability of the argmax sequence using different beam sizes
Again we see that there is a substantial difference between scores, they are just not reflected in task accuracy
sum([memm_beam_search(memm_1, x, 1)[0][0][1] for x,y in dev]) / len(dev)
-3.1982580986830151
sum([memm_beam_search(memm_1, x, 5)[0][0][1] for x,y in dev]) / len(dev)
-3.1442567645564075
Beam search is a
However, often it is also
Probability of a label $y_i$ only depends on the previous label $y_{i-1}$
With this in mind let us follow the beam for an example instance
example = 56
beam, history = memm_beam_search(memm_1, dev[example][0],2)
seq.render_beam_history(history, dev[example], end=17)
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
0.00 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | -0.10 | |||||||||||||||
^ | -3.43 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | N | -0.73 | ||||||||||||||
A | A | -1.01 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | -1.18 | |||||||||||||
A | N | N | -1.40 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | -1.19 | ||||||||||||
A | N | N | P | -1.41 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | -1.54 | |||||||||||
A | N | N | P | N | -1.76 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | -1.54 | ||||||||||
A | N | N | P | N | , | -1.77 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | -1.70 | |||||||||
A | N | N | P | N | , | O | -1.93 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | -2.27 | ||||||||
A | N | N | P | N | , | O | R | -2.49 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | -2.89 | |||||||
A | N | N | P | N | , | O | R | N | -3.11 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | -2.89 | ||||||
A | N | N | P | N | , | O | R | N | P | -3.12 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | -3.51 | |||||
A | N | N | P | N | , | O | R | N | P | N | -3.73 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | P | -3.56 | ||||
A | N | N | P | N | , | O | R | N | P | N | P | -3.78 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | P | V | -3.65 | |||
A | N | N | P | N | , | O | R | N | P | N | P | V | -3.87 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | P | V | P | -3.84 | ||
A | N | N | P | N | , | O | R | N | P | N | P | V | P | -4.06 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | P | V | P | P | -3.85 | |
A | N | N | P | N | , | O | R | N | P | N | P | V | P | P | -4.07 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | N | P | N | P | V | P | P | Z | -4.64 |
A | A | N | P | N | , | O | R | N | P | N | P | V | P | P | L | -4.77 |
Search frontier, the most recent label in each of the hypotheses, often has very little diversity
This leads to an error here
We can fix this error by simply increasing the beam size to 4. You can test this above. In this case "A" barely makes it into the beam, and becomes the winning label in the next step as it fits better to the noun "way".
ignores the factorization or dependency structure of the model.
In this particular case labels only depend on the previous label.
This means that it makes no sense to maintain more than one hypothesis with the same frontier label in the beam. One only needs to remember the highest scoring sequence with that frontier label.
"Proof":
Then log probability of $\y \| y_{l+1}$ is larger than the log probability of $\y' \| y_{l+1}$ and hence there is no need to carry around $\y'$.
leverage conditional independences of the model directly
The algorithm initialises $\alpha_{1}(l) =\log \prob(l|\x,\text{PAD},1)$ and then updates the $\alpha$ map via the following recursion:
$$ \alpha_i(l) = \max_y \alpha_{i-1}(y) + \log \prob(l|\x,y,i) $$and in $\beta_i(l)$ we store the 'winning' $y$ from the $\max$ term.
Once we reached the sequence end the result sequence can be inferred by finding the label $l$ with highest $\alpha_{|\x|}(l)$ and then back-tracking using $\beta$. It is easy to show that this algorithm returns the optimal solution to the prediction/search problem, assuming that labels only depend on the previous label. (Exercise: extend to $n$ previous labels)
Below we implement a beam version of the viterbi algorithm. In this version we restrict the maximisation that defines $\alpha_i(l)$ to only range over the top $k$ highest scoring previous labels $y$.
from collections import defaultdict
import math
def memm_viterbi_search(memm, x, width=2):
labels = memm.labels()
# initialise
alpha = [{}]
beta = [{}]
for label_index, label_score in enumerate(memm.predict_scores_hist(x, 0, ["PAD"])):
label = labels[label_index]
alpha[0][label] = label_score
beta[0][label] = "PAD"
# prune
seq.prune_alpha_beta(alpha[0], beta[0], width)
# recursion
for i in range(1, len(x)):
alpha.append(defaultdict(lambda: -math.inf))
beta.append({})
for p in alpha[i-1].keys():
for label_index, label_score in enumerate(memm.predict_scores_hist(x, i, [p])):
label = labels[label_index]
new_score = alpha[i-1][p] + label_score
if new_score > alpha[i][label]:
alpha[i][label] = new_score
beta[i][label] = p
# prune
seq.prune_alpha_beta(alpha[i], beta[i], width)
# convert to beam history to be used in the same way beam search was used.
history = seq.convert_alpha_beta_to_history(x, alpha, beta)
return history[-1], history
beam, history = memm_viterbi_search(memm_1, dev[example][0],2)
seq.render_beam_history(history, dev[example], 17)
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
0.00 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | -0.10 | |||||||||||||||
^ | -3.43 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | N | -0.73 | ||||||||||||||
A | A | -1.01 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | -1.18 | |||||||||||||
A | N | V | -1.97 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | -1.19 | ||||||||||||
A | A | N | N | -7.50 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | -1.54 | |||||||||||
A | A | N | P | ^ | -2.57 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | -1.54 | ||||||||||
A | A | N | P | ^ | ^ | -5.18 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | -1.70 | |||||||||
A | A | N | P | N | , | D | -4.21 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | R | -2.27 | ||||||||
A | A | N | P | N | , | O | A | -2.77 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | -2.81 | |||||||
A | A | N | P | N | , | O | R | R | -3.68 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | -2.81 | ||||||
A | A | N | P | N | , | O | A | N | N | -9.36 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | -3.43 | |||||
A | A | N | P | N | , | O | A | N | P | V | -4.07 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | P | -3.48 | ||||
A | A | N | P | N | , | O | A | N | P | N | V | -6.94 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | P | V | -3.57 | |||
A | A | N | P | N | , | O | A | N | P | N | P | N | -6.19 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | P | V | P | -3.76 | ||
A | A | N | P | N | , | O | A | N | P | N | P | V | T | -5.42 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | P | V | P | P | -3.77 | |
A | A | N | P | N | , | O | A | N | P | N | P | V | P | ^ | -10.04 |
Happy | International | Year | of | Biodiversity | ! | What | better | way | to | celebrate | than | tuning | in | to | CropLife's | Biodiversity |
A | A | N | P | N | , | O | A | N | P | V | P | V | T | P | Z | ^ |
A | A | N | P | N | , | O | A | N | P | N | P | V | P | P | Z | -4.56 |
A | A | N | P | N | , | O | A | N | P | N | P | V | P | P | L | -4.69 |
Crucially, for the same beam size we now keep the correct labelling of "better" in the beam and reach a better solution, both in terms of log probability and actual accuracy.
This improvement in log probabilities does not always lead to higher global accuracy:
seq.accuracy(dev, batch_predict(dev, lambda x: memm_viterbi_search(memm_1, x, 2)))
0.8140161725067385
MEMMs multiply several locally normalised transition probabilities to arrive at a sequence probability
for each token $i$ and given previous state $y_{i-1}$ sum of transition scores into next states $\sum_{y_i} \prob_\params(y_i|\x,y_{i-1},i)$ equals 1
This local normalisation makes training easy (why?), but it also leads to a problem. Consider two simple sequences "that works" and "that house". The former is a pronoun ("O") followed by a verb ("V"), the latter is a determiner ("D") followed by a noun ("N"). Let us assume that
meaning that at the beginning of a sentence both the determiner and pronoun label for "that" have roughly the same probability 0.5. Now assume that in the training set determiners are always followed by nouns, and pronouns always by verbs. This would mean that
$$ \prob_\params(\text{N}|\x,\text{D},i) = \prob_\params(\text{V}|\x,\text{O},i) \approx 1 $$
and hence transitions from these two states are completely independent of the observation.
Now we have $\prob_\params(\text{D N}|\, \text{that works}) \approx 0.5$ and $\prob_\params(\text{O V}|\, \text{that works}) \approx 0.5$, and the same for the input "that house". This means that once we enter the "D" or "O" state, the following observations have no effect on the sequence probability. The reason is that MEMMs requires all incoming probability mass (0.5 in the above example) to a given state (such as "D" or "O") to be distributed among the outgoing states. If there is only one possible next state, then that next state will receive all the mass, regardless of the observation. In particular, the model cannot say "in state "D" and for observation "works", all labels are impossible. More generally, states with few outgoing distributions effectively ignore observations, and this creates a bias towards such states. This problem is known as the label bias problem.
Conditional Random Fields (CRFs) have been developed to overcome the label bias. The core problem of MEMMs is local normalisation. CRFs replace this with global normalisation. That is instead of normalising across all possible next states $y_{i+1}$ given a current state $y_i$ and observation $\x$, the CRF normalises across all possible sequences $\y$ given observation $\x$. Formally the CRF is hence defined as follows:
$$ p_\params(\y|\x) = \frac{1}{Z_{\x}} \prod_i^{|\x|} \exp \langle \repr(\x,y_{i-1},i), \params_{y_i} \rangle $$where $Z_{\x}=\sum_\y \prod_i^{|\x|} \exp \langle \repr(\x,y_{i-1},i), \params_{y_i} \rangle$ is the partition function, a global normalisation constant depending on $\x$. Notably each term $\exp \langle \repr(\x,y_{i-1},i), \params_{y_i} \rangle$ in the product can now take on values in $[0,\infty)$ as opposed to the MEMM terms in $[0,1]$.
The name CRF stems from the fact they correspond to Markov random fields, globally conditioned on the observation $\x$. While in this chapter we focus on cases where the dependency structure corresponds to a linear chain, CRFs are more general and encompass any graphical structure.
CRFs can be trained (as usual) by maximising the conditional log-likelihood of the data
$$ CL(\params) = \sum_{(\x,\y) \in \train} \log \prob_\params(\y|\x). $$This is substantially harder than for MEMMs because the partition function makes it impossible to break up the objective into only per-token logistic regression terms. Instead the objective needs to be treated on a per-sequence basis. Conceptually this is not difficult: just as for logistic regression we need to calculate the gradient of the objective, and once we have this gradient, we choose a gradient descent/ascent method to optimise the function. The general CRF conditional log-likelihood is in fact a generalisation of the logistic regression objective, and hence the CRF gradient will look very similar to the gradient of logistic regression. We discuss this gradient, its derivation and calculation in more detail TODO. In this chapter we will only briefly discuss the gradient and what is necessary to calculate it.
\begin{split} \nabla_{y'} CL(\params) = \sum_{(\x,\y) \in \train} \sum^{|\x|}_i\repr(\x,y_{i-1},i) \delta(y_i,y') - p_\params(y',y_{i-1}|\x) \repr(\x,y_{i-1},i) \end{split}From the perspective of finding $\argmax_\y \prob_\params(\y|\x)$ we can treat the CRF just as the MEMM. The share the same factorisation/dependency structure and are just differently normalised. That is, we can again simply perform greedy search, use a beam or apply Viterbi.
Below we train a CRF model using the same feature vectors as used for the MEMM model, and then do prediction via the Viterbi algorithm (this is the standard algorithm in most CRF libraries).
crf_1 = seq.CRFSequenceLabeler(feat_4, train)
seq.accuracy(dev, crf_1.predict(dev))
0.8206510470661414
A notable 1% point improvement over the MEMM model that essentially comes from free in the sense that we are using exactly the same feature representation and are just changing from local to global normalisation.
util.Table([
["word", seq.accuracy(test, local_1.predict(test))],
["+ first @", seq.accuracy(test, local_2.predict(test))],
["+ cap", seq.accuracy(test, local_3.predict(test))],
["+ suffix", seq.accuracy(test, local_4.predict(test))],
["MEMM", seq.accuracy(test, memm_1.predict(test))],
["CRF", seq.accuracy(test, crf_1.predict(test))]
])
word | 0.6511465324384788 |
+ first @ | 0.7572706935123042 |
+ cap | 0.7760067114093959 |
+ suffix | 0.7971196868008948 |
MEMM | 0.8111017897091722 |
CRF | 0.8239653243847874 |