More metrics: BLEU (5+ pts)¶

Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from nltk.bleu_score).

Train model to maximize BLEU directly
How does levenshtein behave when maximizing BLEU and vice versa?
Compare this with how they behave when optimizing likelihood.

(use default parameters for bleu: 4-gram, uniform weights)

Actor-critic (5+++ pts)¶

While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:

It requires a lot of additional computation during training
It doesn't adjust V(s) between decoder steps. (one value per sequence)

There's a more general way of doing the same thing: learned baselines, also known as advantage actor-critic.

There are two main ways to apply that:

naive way: compute V(s) once per training example.
- This only requires additional 1-unit linear dense layer that grows out of encoder, estimating V(s)
- (implement this to get main points)
every step: compute V(s) on each decoder step
- Again it's just an 1-unit dense layer (no nonlinearity), but this time it's inside decoder recurrence.
- (+3 pts additional for this guy)

In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein. You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.

There's also one particularly interesting approach (+5 additional pts):

combining SCST and actor-critic:
- compute baseline $V(s)$ via self-critical sequence training (just like in main assignment)
- learn correction $ C(s,a_{:t}) = R(s,a) - V(s) $ by minimizing $(R(s,a) - V(s) - C(s,a_{:t}))^2 $
- use $ A(s,a_{:t}) = R(s,a) - V(s) - const(C(s,a_{:t})) $

Implement attention (5+++ pts)¶

Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the last time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.

Recommended steps:¶

1) Modify encoder-decoder

Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn layer directly into decoder (make sure there's no only_return_final=True for encoder rnn layer).

class decoder:
    ...
    encoder_rnn_input = InputLayer(encoder.rnn.output_shape, name='encoder rnn input for decoder')
    ...

#decoder Recurrence
rec = Recurrence(...,
                 input_nonsequences = {decoder.encoder_rnn_input: encoder.rnn},
                 )

For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.

2) Implement attention mechanism

Next thing we'll need is to implement the math of attention.

The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.

3) Use attention inside decoder

That's almost it! Now use AttentionLayer inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).

Train the full network just like you did before attention.

More points will be awwarded for comparing learning results of attention Vs no attention.

Bonus bonus: visualize attention vectors (>= +3 points)

The best way to make sure your attention actually works is to visualize it.

A simple way to do so is to obtain attention vectors from each tick (values right after softmax, not the layer outputs) and drawing those as images.

step-by-step guide:¶

split AttentionLayer into two layers: "from start to softmax" and "from softmax to output"
add outputs of the first layer to recurrence's tracked_outputs
compile a function that computes them
plt.imshow(them)

In [ ]:

import numpy as np
import theano,lasagne
import theano.tensor as T
from lasagne import init
from lasagne.layers import *

In [ ]:

class AttentionLayer(MergeLayer):
    def __init__(self,decoder_h,encoder_rnn):
        #sanity checks
        assert len(decoder_h.output_shape)==2,"please feed decoder 1 step activation as first param "
        assert len(encoder_rnn.output_shape)==3, "please feed full encoder rnn sequence as second param"
        
        self.decoder_num_units = decoder_h.output_shape[-1]
        self.encoder_num_units = encoder.output_shape[-1]

        #Here you should initialize all trainable parameters.
        #
        
        #use this syntax:
        self.add_param(spec=init.Normal(std=0.01), #or other initializer
                       shape=<shape tuple>,
                       name='<param name here>')
        
        
        MergeLayer.__init__(self,[decoder_h,encoder_rnn],name="attention")
        
        
    def get_output_shape_for(self,input_shapes,**kwargs):
        """return matrix of shape [batch_size, encoder num units]"""
        return (None,self.encoder_num_units)
        
    def get_output_for(self,inputs,**kwargs):
        """
        takes (decoder_h, encoder_seq)
        decoder_h has shape [batch_size, decoder num_units]
        encoder_seq has shape [batch_size, sequence_length, encoder num_units]
        
        returns attention output: matrix of shape [batch_size, encoder num units]
        
        please read comments carefully before you start implementing
        """
        decoder_h,encoder_seq = inputs
        
        #get symbolic batch-size / seq length. Also don't forget self.decoder_num_units above
        batch_size,seq_length,_ = tuple(encoder_seq.shape)
        
        #here's a recommended step-by-step guide for attention mechanism. 
        #You are free to ignore it alltogether if you so wish
        
        #we repeat decoder activations to allign with encoder
        decoder_h_repeated = <cast decoder_h into [batch,seq_length,decoer_num_units] by 
                              repeating it _seq_length_ times>
                             <use T.repeat and maybe some reshape>
        # ^--shape=[batch,seq_length,decoder_n_units]
        
        encoder_and_decoder_together = <concatenate repeated decoder and encoder over last axis>
        # ^--shape=[batch,seq_length,enc_n_units+dec_n_units]
        
        #here we flatten the tensor to simplify
        encoder_and_decoder_flat = T.reshape(encoder_and_decoder_together,(-1,encoder_and_decoder_together.shape[-1]))
        # ^--shape=[batch*seq_length,enc_n_units+dec_n_units]
        
        #here you use encoder_and_decoder_flat and some learned weights to predict attention logits
        #don't use softmax yet
        <your code here>
        attention_logits_flat = <logits to be used as attention weights>
        # ^--shape=[batch*seq_length,1]
        
        
        #here we reshape flat logits back into correct form
        assert attention_logits_flat.ndim==2
        attention_logits = attention_logits_flat.reshape((batch_size,seq_length))
        # ^--shape=[batch,seq_length]
        
        #here we apply softmax :)
        attention = T.nnet.softmax(attention_logits)
        # ^--shape=[batch,seq_length]
        
        #here we compute output
        output = (attention[:,:,None]*encoder_seq).sum(axis=1) #sum over seq_length
        # ^--shape=[batch,enc_n_units]
        
        return output

In [ ]:

#demo code

from numpy.random import randn

dec_h_prev = InputLayer((None,50),T.constant(randn(5,50)),name='decoder h mock')

enc = InputLayer((None,None,32),T.constant(randn(5,20,32)),name='encoder sequence mock')

attention = AttentionLayer(dec_h_prev,enc)

#now you can use attention as additonal input to your decoder
#LSTMCell(prev_cell,prev_out,input_or_inputs=(usual_input,attention))


#sanity check
demo_output = get_output(attention).eval()
print 'actual shape:',demo_output.shape
assert demo_output.shape == (5,32)
assert np.isfinite(demo_output)

In [ ]: