%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
font-size: xx-large;
text-align: left; !important
}
.rendered_html th {
font-size: xx-large;
text-align: left; !important
}
</style>
%%capture
import sys
sys.path.append("..")
import statnlpbook.util as util
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
%load_ext tikzmagic
Background: neural MT (5 min.)
Math: attention (10 min.)
Math: self-attention (10 min.)
Background: BERT (15 min.)
Background: mBERT (5 min.)
Quiz: mBERT (5 min.)
You can't cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!
— Ray Mooney
Attention model takes as input:
Attention model produces:
Usually with a very simple feedforward neural network.
For example:
$$ f_{\mathrm{att}}(\mathbf{s}_{t-1}^{\textrm{dec}}, \mathbf{h}_j^{\textrm{enc}}) = \tanh \left( \mathbf{W}^s \mathbf{s}_{t-1}^{\textrm{dec}} + \mathbf{W}^h \mathbf{h}_j^{\textrm{enc}} \right) $$This is called additive attention.
Another alternative:
$$ f_{\mathrm{att}}(\mathbf{s}_{t-1}^{\textrm{dec}}, \mathbf{h}_j^{\textrm{enc}}) = \frac{\left(\mathbf{s}_{t-1}^{\textrm{dec}}\right)^\intercal \mathbf{W} \mathbf{h}_j^{\textrm{enc}}} {\sqrt{d_{\mathbf{h}^{\textrm{enc}}}}} $$This is called scaled dot-product attention.
(But many alternatives have been proposed!)
Computing a context vector:
$$ \mathbf{c}_t = \sum_{i=1}^n \mathbf{\alpha}_{t,i} \mathbf{h}_i^\mathrm{enc} $$This is the weighted combination of the input representations!
Include this context vector in the calculation of decoder's hidden state:
$$ \mathbf{s}_t^{\textrm{dec}} = f\left(\mathbf{s}_{t-1}^{\textrm{dec}}, \mathbf{y}_{t-1}^\textrm{dec}, \mathbf{c}_t\right) $$Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector.
We can visualize what this model learns in an
$\rightarrow$ Simply concatenate all $\alpha_t$ for $1 \leq t \leq m$
In other words:
$\color{orange}{\mathbf{W} \mathbf{h}_j^{\textrm{enc}}}$ as the key 3. Softly select a $\color{blue}{\mathbf{h}_j^{\textrm{enc}}}$ as value
$$ \mathbf{\alpha}_{t,j} = \text{softmax}\left( \frac{\left(\color{purple}{\mathbf{s}_{t-1}^{\textrm{dec}}}\right)^\intercal \color{orange}{\mathbf{W} \mathbf{h}_j^{\textrm{enc}}}} {\sqrt{d_{\mathbf{h}^{\textrm{enc}}}}} \right) \\ \mathbf{c}_t = \sum_{i=1}^n \mathbf{\alpha}_{t,i} \color{blue}{\mathbf{h}_i^\mathrm{enc}} \\ \mathbf{s}_t^{\textrm{dec}} = f\left(\mathbf{s}_{t-1}^{\textrm{dec}}, \mathbf{y}_{t-1}^\textrm{dec}, \mathbf{c}_t\right) $$Used during decoding to attend to $\mathbf{h}^{\textrm{enc}}$, encoded by Bi-LSTM.
More recent development:
Forget about Bi-LSTMs, because Attention is All You Need even for encoding!
Use $\mathbf{h}_i$ to create three vectors: $\color{purple}{\mathbf{q}_i}=W^q\mathbf{h}_i, \color{orange}{\mathbf{k}_i}=W^k\mathbf{h}_i, \color{blue}{\mathbf{v}_i}=W^v\mathbf{h}_i$.
$$ \mathbf{\alpha}_{i,j} = \text{softmax}\left( \frac{\color{purple}{\mathbf{q}_i}^\intercal \color{orange}{\mathbf{k}_j}} {\sqrt{d_{\mathbf{h}}}} \right) \\ \mathbf{h}_i^\prime = \sum_{j=1}^n \mathbf{\alpha}_{i,j} \color{blue}{\mathbf{v}_j} $$In matrix form:
$$ \text{softmax}\left( \frac{\color{purple}{Q} \color{orange}{K}^\intercal} {\sqrt{d_{\mathbf{h}}}} \right) \color{blue}{V} $$Unlike RNNs, no inherent locality bias!
Attends to encoded input and to partial output.
The encoder transformer is sometimes called "bidirectional transformer".
Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.
Other pre-trained checkpoints: https://github.com/google-research/bert
Trained on 16GB of text from Wikipedia + BookCorpus.
Predict masked words given context on both sides:
Conditional encoding of both sentences:
Liu et al., 2019: bigger is better.
BERT with additionally
and no next-sentence-prediction task (only masked LM).
Training: 1024 GPUs for one day.
https://github.com/google-research/bert/blob/master/multilingual.md
(CamemBERT, BERTje, Nordic BERT...)
mBERT is unreasonably effective at cross-lingual transfer!
NER F1:
POS accuracy:
Why? (poll)
See also K et al., 2020; Wu and Dredze., 2019.
The attention mechanism alleviates the encoding bottleneck in encoder-decoder architectures
Attention can even replace (bi)-LSTMs, giving self-attention
Transformers rely on self-attention for encoding and decoding
BERT, GPT-$n$ and other transformers are powerful pre-trained contextualized representations
Multilingual pre-trained transformers enable zero-shot cross-lingual transfer
Attention:
Transformers