Here are some cool mini-projects you can try to dive deeper into the topic.
Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from nltk.bleu_score
).
(use default parameters for bleu: 4-gram, uniform weights)
While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:
There's a more general way of doing the same thing: learned baselines, also known as advantage actor-critic.
There are two main ways to apply that:
In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein. You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.
There's also one particularly interesting approach (+5 additional pts):
Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the last time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.
1) Modify encoder-decoder
Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn layer directly into decoder (make sure there's no only_return_final=True
for encoder rnn layer).
class decoder:
...
encoder_rnn_input = InputLayer(encoder.rnn.output_shape, name='encoder rnn input for decoder')
...
#decoder Recurrence
rec = Recurrence(...,
input_nonsequences = {decoder.encoder_rnn_input: encoder.rnn},
)
For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.
2) Implement attention mechanism
Next thing we'll need is to implement the math of attention.
The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.
3) Use attention inside decoder
That's almost it! Now use AttentionLayer
inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).
Train the full network just like you did before attention.
More points will be awwarded for comparing learning results of attention Vs no attention.
Bonus bonus: visualize attention vectors (>= +3 points)
The best way to make sure your attention actually works is to visualize it.
A simple way to do so is to obtain attention vectors from each tick (values right after softmax, not the layer outputs) and drawing those as images.
tracked_outputs
import numpy as np
import theano,lasagne
import theano.tensor as T
from lasagne import init
from lasagne.layers import *
class AttentionLayer(MergeLayer):
def __init__(self,decoder_h,encoder_rnn):
#sanity checks
assert len(decoder_h.output_shape)==2,"please feed decoder 1 step activation as first param "
assert len(encoder_rnn.output_shape)==3, "please feed full encoder rnn sequence as second param"
self.decoder_num_units = decoder_h.output_shape[-1]
self.encoder_num_units = encoder.output_shape[-1]
#Here you should initialize all trainable parameters.
#
#use this syntax:
self.add_param(spec=init.Normal(std=0.01), #or other initializer
shape=<shape tuple>,
name='<param name here>')
MergeLayer.__init__(self,[decoder_h,encoder_rnn],name="attention")
def get_output_shape_for(self,input_shapes,**kwargs):
"""return matrix of shape [batch_size, encoder num units]"""
return (None,self.encoder_num_units)
def get_output_for(self,inputs,**kwargs):
"""
takes (decoder_h, encoder_seq)
decoder_h has shape [batch_size, decoder num_units]
encoder_seq has shape [batch_size, sequence_length, encoder num_units]
returns attention output: matrix of shape [batch_size, encoder num units]
please read comments carefully before you start implementing
"""
decoder_h,encoder_seq = inputs
#get symbolic batch-size / seq length. Also don't forget self.decoder_num_units above
batch_size,seq_length,_ = tuple(encoder_seq.shape)
#here's a recommended step-by-step guide for attention mechanism.
#You are free to ignore it alltogether if you so wish
#we repeat decoder activations to allign with encoder
decoder_h_repeated = <cast decoder_h into [batch,seq_length,decoer_num_units] by
repeating it _seq_length_ times>
<use T.repeat and maybe some reshape>
# ^--shape=[batch,seq_length,decoder_n_units]
encoder_and_decoder_together = <concatenate repeated decoder and encoder over last axis>
# ^--shape=[batch,seq_length,enc_n_units+dec_n_units]
#here we flatten the tensor to simplify
encoder_and_decoder_flat = T.reshape(encoder_and_decoder_together,(-1,encoder_and_decoder_together.shape[-1]))
# ^--shape=[batch*seq_length,enc_n_units+dec_n_units]
#here you use encoder_and_decoder_flat and some learned weights to predict attention logits
#don't use softmax yet
<your code here>
attention_logits_flat = <logits to be used as attention weights>
# ^--shape=[batch*seq_length,1]
#here we reshape flat logits back into correct form
assert attention_logits_flat.ndim==2
attention_logits = attention_logits_flat.reshape((batch_size,seq_length))
# ^--shape=[batch,seq_length]
#here we apply softmax :)
attention = T.nnet.softmax(attention_logits)
# ^--shape=[batch,seq_length]
#here we compute output
output = (attention[:,:,None]*encoder_seq).sum(axis=1) #sum over seq_length
# ^--shape=[batch,enc_n_units]
return output
#demo code
from numpy.random import randn
dec_h_prev = InputLayer((None,50),T.constant(randn(5,50)),name='decoder h mock')
enc = InputLayer((None,None,32),T.constant(randn(5,20,32)),name='encoder sequence mock')
attention = AttentionLayer(dec_h_prev,enc)
#now you can use attention as additonal input to your decoder
#LSTMCell(prev_cell,prev_out,input_or_inputs=(usual_input,attention))
#sanity check
demo_output = get_output(attention).eval()
print 'actual shape:',demo_output.shape
assert demo_output.shape == (5,32)
assert np.isfinite(demo_output)