Recurrent Neural Networks¶

Overview¶

Recap: Language Modeling
Recurrent Neural Network (RNN) Language Models
Training Problems and Solutions
- Vanishing and Exploding Gradients
- Long Short-Term Memory (LSTM) Networks
RNNs for Sequence to Sequence Problems
Extensions

Language Modeling (LMs)¶

A LM computes a probability for a sequence of words

$$p(\langle w_{1}, \ldots, w_{d} \rangle)$$

Useful in a miriad of NLP tasks involving text generation, e.g.

Machine Translation,
Speech Recognition,
Summarisation..

$$ \begin{aligned} p(\langle \text{Statistical}, \text{Natural}, \text{Language}, \text{Processing} \rangle) > \\ p(\langle \text{Statistical}, \text{Language}, \text{Natural}, \text{Processing} \rangle) \end{aligned} $$

$n$-Gram Language Models¶

In $n$-gram language models, the probability $p(w_{1}, \ldots, w_{d})$ of observing the sentence $(w_{1}, \ldots, w_{d})$ is approximated as:

$$ \begin{aligned} p(w_{1}, \ldots, w_{d}) & = \prod_{i=1}^{d} p(w_{i} \mid w_{1}, \ldots, w_{i - 1}) \\ & \approx \prod_{i=1}^{d} p(w_{i} \mid w_{i - (n - 1)}, \ldots, w_{i - 1}) \\ & \approx \prod_{i=1}^{d} \frac{\text{count}(w_{i - (n - 1)}, \ldots, w_{i})}{\text{count}(w_{i - (n - 1)}, \ldots, w_{i - 1})} \end{aligned} $$

Example with a bigram ($n = 2$) language model:

$$ \begin{aligned} p(\langle \text{Natural}, & \text{Language}, \text{Processing} \rangle) \approx \\ & p(\text{Natural}){}\cdot{}p(\text{Language} \mid \text{Natural}) \\ & {}\cdot{}p(\text{Processing} \mid \text{Language}) \end{aligned} $$

Recurrent Neural Networks¶

RNNs share the weights at each time step
The output $y_{t}$ at time $t$ depends on all previous words
- $w_{t}, w_{t - 1}, \ldots, w_{1}$
Size scales with number of words, not sequence length!

In [9]:

%%tikz -l arrows -s 1000,400 -sc 0.65

\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (0,0) rectangle (3,3);
}

\lstm    
\node[] at (0.5,-1.25) {$\mathbf{x}_t$};
\node[] at (-1.5,2) {$\mathbf{h}_{t-1}$};
\node[] at (4.25,2) {$\mathbf{h}_t$};
\node[] at (2.5,5) {$\mathbf{y}_t$};

\draw[ultra thick, ->, >=stealth'] (0.5,-0.75) -- (0.5,0);
\draw[ultra thick, ->, >=stealth'] (-0.75,2) -- (0,2);      
\draw[ultra thick, ->, >=stealth'] (3,2) -- (3.75,2); 
\draw[ultra thick, ->, >=stealth'] (2.5,3) -- (2.5,4.75);      

\path[line width=3pt, ->, >=stealth', color=nice-red] (4, 2.5) edge[bend right=0, in=-110, out=-70] (-1.75, 2.5);      
      
\node[] at (1.5,2) {$f_\theta(\mathbf{x}_t, \mathbf{h}_{t-1})$};

\begin{align} \mathbf{h}_t &= f_{\theta}(\mathbf{x}_{t}, \mathbf{h}_{t - 1}) \\ f_{\theta} \; & \text{is a } \textbf{transition function} \text { with parameters } \theta\\ \theta \; & \text{can be } \textbf{learned from data}\\ \\ \\ & \text{"Vanilla" Recurrent Neural Network} \\ \mathbf{h}_t &= \text{sigmoid}(\mathbf{W}^h \mathbf{h}_{t-1}+ \mathbf{W}^x \mathbf{x}_t) \end{align}

A Recurrent Neural Network LM¶

Consider the following sentence:

$$\langle w_{1}, \ldots, w_{t - 1}, w_{t}, w_{t + 1}, \ldots, w_{d})$$

At each single time step $t$, the hidden state $\mathbf{h}_t$ and output $\hat{\mathbf{y}}_t$ are given by:

$$ \begin{aligned} \mathbf{x}_{1} & = \text{encode}(w_{t}) \in \mathbb{R}^{d_{e}}\\ \mathbf{h}_t & = \sigma(\mathbf{W}^h \mathbf{h}_{t-1}+ \mathbf{W}^x \mathbf{x}_t) \in \mathbb{R}^{d_{h}}\\ \hat{\mathbf{y}}_{t} & = \text{softmax}(\mathbf{W}^o \mathbf{h}_{t}) \in \mathbb{R}^{|V|} \\ \end{aligned} $$

where $\mathbf{y}_{t} \in [0, 1]^{|V|}$ is a probability distribution over words in $V$.

The probability that the $t$-th word in the sequence is $w_{j}$ is given by:

$$p(w_{j} \mid w_{t}, \ldots, w_{1}) = \hat{\mathbf{y}}_{t, j}$$

Example¶

Consider the word sequence $\text{encode}(\text{Natural}, \text{Language}, \text{Processing}) \rightarrow (\mathbf{x}_{1}, \mathbf{x}_{2}, \mathbf{x}_{3})$

Reminder: $\mathbf{h}_t = \sigma(\mathbf{W}^h \mathbf{h}_{t-1}+ \mathbf{W}^x \mathbf{x}_t + \mathbf{b})$

$$ \begin{aligned} \mathbf{h}_1 = \sigma(\mathbf{W}^h \mathbf{h}_{0} + \mathbf{W}^x \mathbf{x}_1) &\;& \hat{\mathbf{y}}_{1} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{1}) \\ \mathbf{h}_2 = \sigma(\mathbf{W}^h \mathbf{h}_{1} + \mathbf{W}^x \mathbf{x}_2) &\;& \hat{\mathbf{y}}_{2} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{2}) \\ \mathbf{h}_3 = \sigma(\mathbf{W}^h \mathbf{h}_{2} + \mathbf{W}^x \mathbf{x}_3) &\;& \hat{\mathbf{y}}_{3} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{3}) \\ \end{aligned} $$$$p(\text{Natural}, \text{Language}, \text{Processing}) = \hat{\mathbf{y}}_{1, [\text{Natural}]} \; \hat{\mathbf{y}}_{2, [\text{Language}]} \; \hat{\mathbf{y}}_{3, [\text{Processing}]}$$

Initial state: $\mathbf{h}_{0} \in \mathbb{R}^{d_{h}}$, Input matrix: $\mathbf{W}^x \in \mathbb{R}^{d_{h} \times d_{x}}$
Transition matrix: $\mathbf{W}^h \in \mathbb{R}^{d_{h} \times d_{h}}$, Output matrix: $\mathbf{W}^o \in \mathbb{R}^{|V| \times d_{h}}$

Objective Function¶

Recall that $\hat{\mathbf{y}}_{t} \in \mathbb{R}^{|V|}$ is a probability distribution over the vocabulary $V$.

We can train a RNN by minimizing the cross-entropy loss, predicting words instead of classes:

$$ \begin{aligned} J_{t} = - \sum_{i = 1}^{|V|} \mathbf{y}_{t, i} \log \hat{\mathbf{y}}_{t, i}, \quad \text{where} \quad \mathbf{y}_{t, i} = \left\{\begin{array}{ll}1 \; \text{if the $t$-th word is $w_{i}$,}\\0 \, \text{otherwise.}\end{array} \right. \end{aligned} $$

Evaluating Language Models¶

Evaluation - negative of average log-probability over corpus:

$$J = - \frac{1}{T} \sum_{t = 1}^{T} \sum_{j = 1}^{|V|} \mathbf{y}_{t, j} \log \hat{\mathbf{y}}_{t, j} = \frac{1}{T} J_{t}$$

Or also perplexity:

$$PP(w_1,\ldots,w_T) = \sqrt[T]{\prod_{i = 1}^T \frac{1}{p(w_i | w_{1}, \ldots, w_{i-1})}}$$

Sequence-to-Sequence Models¶

Recurrent Neural Networks are extremely powerful and flexible

They can also learn to generate sequences

Seq2Seq models are composed by:

Encoder - Gets the input and outputs $\mathbf{v} \in \mathbb{R}^{d}$
Decoder - Gets $\mathbf{v}$ and generates the output sequence

Seq2Seq models are widely popular in e.g.:

Neural Machine Translation
Text Summarization
Learning to Execute

In [7]:

%%tikz -l arrows -s 1000,400 -sc 0.65

\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

\newcommand{\lstm}{

%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (0,0) rectangle (3,3);
    
\draw[ultra thick, ->, >=stealth'] (0.5,-0.75) -- (0.5,0);
\draw[ultra thick, ->, >=stealth'] (-0.75,2) -- (0,2);      
\draw[ultra thick, ->, >=stealth'] (3,2) -- (3.75,2); 
\draw[ultra thick, ->, >=stealth'] (2.5,3) -- (2.5,3.75);      
}

%\lstm    
      
%\node[] at (0.5,-1.25) {$\mathbf{x}_t$};
%\node[] at (-1.5,2) {$\mathbf{h}_{t-1}$};
%\node[] at (4.25,2) {$\mathbf{h}_t$};
%\node[] at (2.5,5) {$\mathbf{h}_t$};    
%\path[line width=3pt, ->, >=stealth', color=nice-blue] (4, 2.5) edge[bend right=0, in=-110, out=-70] (-1.75, 2.5);            
%\node[] at (1.5,2) {$f_\theta(\mathbf{x}_t, \mathbf{h}_{t-1})$};

\foreach \x/\w in {0/I, 1/like, 2/neural, 3/networks} {
    \begin{scope}[shift={(\x*3.75,0)}]
        \lstm    
        \node[font=\LARGE, text height=1.5ex, color=nice-red] at (0.5,-1.5) {\bf\w};                                                                                    
    \end{scope}    
}

\foreach \x/\w/\t in {4/EOS/Ich, 5/Ich/mag, 6/mag/neuronale, 7/neuronale/Netze, 8/Netze/EOS} {
    \begin{scope}[shift={(\x*3.75,0)}]
        \lstm    
        \node[font=\LARGE, text height=1.5ex] at (0.5,-1.5) {\bf\w};  
        \node[font=\LARGE, text height=1.5ex, color=nice-blue] at (2.5,4.5) {\bf\t};                                                                                                                
    \end{scope}    
}       

\node[font=\Huge, color=nice-red] at (16.5,1.5) {$\mathbf{v}$};   

Problem - Training RNNs is Hard¶

Vanishing and exploding gradients [Pascanu et al. 2013].

Why? Multiply the same matrix $\mathbf{W}^{h}$ at each time step during forward propagation. The norm of the gradient might either tend to 0 (vanish) or be too large (explode).

Words from time steps far away are hardly considered when training to predict the next word.

Example:

John walked to the hallway.
Mary walked in too.
Daniel moved to the garden.
John said "Hi" to ____.

A RNN is very likely to e.g. put an uniform probability distributions over nouns in $V$, and a low probability everywhere else.

It's an issue with language modeling, question answering, and many other tasks.

Vanishing/Exploding Gradients - Solutions¶

Several solutions in the literature:

Bound the gradient to a threshold (Gradient Clipping)
[Pascanu et al. 2013]
Use $\text{ReLU}(x) = \max(0, x)$ (Rectified Linear Units) or similar non-linearities instead of $\text{sigmoid}(x)$ or $\text{tanh}(x)$
[Glorot et al. 2011].
Clever Initialization of the Transition Matrix ($\mathbf{W}^h = \mathbf{I}$)
[Socher et al. 2013, Le et al. 2015].
Use different recurrent models that favour backpropagation
LSTM[Hochreiter et al. 1997], GRU[Chung et al. 2014].

Long Short-Term Memory (LSTM) Networks¶

Can adaptively learn what to keep (store) into memory (gate $\mathbf{i}_{t}$), forget (gate $\mathbf{f}_{t}$) and output (gate $\mathbf{o}_{t}$)

In [7]:

%%tikz -l arrows -s 1000,400 -sc 0.65

\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);

%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);

%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
        \draw[ultra thick, color=\color] (0,0) circle (0.5cm);
    \end{scope}
}

%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
    
    
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, color=black] (0,0) circle (0.05cm);
        \draw[ultra thick, color=black] (0,0) circle (0.5cm);
    \end{scope}
}

%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
    \draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}


\foreach \fx/\fy/\tx/\ty in {
    -5/-3.5/-5/0.85, %xt
    -5/0.85/-4.2/0.85,
    -6.5/4.25/-5/4.25, %ht1
    -5/4.25/-5/1.15,
    -5/1.15/-4.2/1.15,
    -3.75/1/-3/1, %H
    -3/4.25/-3/-2,
    -3/-2/0.25/-2, %i
    0.5/-1.75/0.5/-1.25,
    -3/-1/-2.25/-1, %it
    -1.75/-1/0.25/-1,
    -3/1/-2.25/1, %ft
    -1.75/1/-1.25/1,
    -0.75/1/0/1,
    -3/4.25/-2.25/4.25, %ot
    -1.75/4.25/0.25/4.25,
    0.5/2/0.5/2.75, %ct
    -5.5/2/-5.1/2, %ct1
    -5.5/2/-5.5/1,
    -6.5/1/-5.5/1,
    -4.9/2/-3.1/2,
    -2.9/2/-1/2,
    -1/2/-1/1.25   
} {
    \draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}

\foreach \fx/\fy/\tx/\ty in {
    0.5/-0.75/0.5/0, %it
    -0.75/1/0/1, %ft
    1/1/2.25/1,
    0.5/3.25/0.5/4,
    0.75/4.25/2.25/4.25, %ht    
    0.5/4.5/0.5/6    
} {
    \draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}

%\begin{scope}[scale=0.8]                    
%\foreach \d in {0,1} {                    
%\foreach \t in {0,1,2,3,4} {          
%\begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]          
%    \lstm
%\end{scope}   
%}
%}
%\end{scope}          

          
\lstm 
          
%annotations
\node[] at (-5,-3.75) {$\mathbf{x}_t$};
\node[anchor=east] at (-6.5,4.25) {$\mathbf{h}_{t-1}$};
\node[anchor=east] at (-6.5,1) {$\mathbf{c}_{t-1}$};
\node[] at (0.5,6.25) {$\mathbf{h}_t$};
\node[anchor=west] at (2.25,4.25) {$\mathbf{h}_t$};
\node[anchor=west] at (2.25,1) {$\mathbf{c}_t$};          
\node[xshift=0.4cm,yshift=0.25cm] at (-4,1) {$\mathbf{H}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,-1) {$\mathbf{i}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,1) {$\mathbf{f}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,4.25) {$\mathbf{o}_t$};     
          
%dummy node for left alignment
\node[] at (17,0) {};          

\begin{align} \\ \mathbf{H}_t &= \left[ \begin{array}{*{20}c} \mathbf{x}_t \\ \mathbf{h}_{t-1} \end{array} \right]\\ \mathbf{i}_t &= \text{sigmoid}(\mathbf{W}^i\mathbf{H}+\mathbf{b}^i)\\ \mathbf{f}_t &= \text{sigmoid}(\mathbf{W}^f\mathbf{H}+\mathbf{b}^f)\\ \mathbf{o}_t &= \text{sigmoid}(\mathbf{W}^o\mathbf{H}+\mathbf{b}^o)\\ \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tanh(\mathbf{W}^c\mathbf{H}+\mathbf{b}^c)\\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \end{align}

Sentence Encoding¶

In [8]:

%%tikz -l arrows -s 1000,400 -sc 0.65

\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);

%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);

%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
        \draw[ultra thick, color=\color] (0,0) circle (0.5cm);
    \end{scope}
}

%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
    
    
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, color=black] (0,0) circle (0.05cm);
        \draw[ultra thick, color=black] (0,0) circle (0.5cm);
    \end{scope}
}

%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
    \draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}


\foreach \fx/\fy/\tx/\ty in {
    -5/-3.5/-5/0.85, %xt
    -5/0.85/-4.2/0.85,
    -6.5/4.25/-5/4.25, %ht1
    -5/4.25/-5/1.15,
    -5/1.15/-4.2/1.15,
    -3.75/1/-3/1, %H
    -3/4.25/-3/-2,
    -3/-2/0.25/-2, %i
    0.5/-1.75/0.5/-1.25,
    -3/-1/-2.25/-1, %it
    -1.75/-1/0.25/-1,
    -3/1/-2.25/1, %ft
    -1.75/1/-1.25/1,
    -0.75/1/0/1,
    -3/4.25/-2.25/4.25, %ot
    -1.75/4.25/0.25/4.25,
    0.5/2/0.5/2.75, %ct
    -5.5/2/-5.1/2, %ct1
    -5.5/2/-5.5/1,
    -6.5/1/-5.5/1,
    -4.9/2/-3.1/2,
    -2.9/2/-1/2,
    -1/2/-1/1.25   
} {
    \draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}

\foreach \fx/\fy/\tx/\ty in {
    0.5/-0.75/0.5/0, %it
    -0.75/1/0/1, %ft
    1/1/2.25/1,
    0.5/3.25/0.5/4,
    0.75/4.25/2.25/4.25, %ht    
    0.5/4.5/0.5/6    
} {
    \draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}

\begin{scope}[scale=0.8]                    
\foreach \d in {0} {                    
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {  
    \node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};                                                                                
    \begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]  
        \lstm                    
    \end{scope}   
}
}
\end{scope}          

\node[font=\Huge, anchor=west] at (27,5.75) {$\mathbf{v}$\_Sentence};                                                                                
          
          
%dummy node for left alignment
\node[] at (17,0) {};          

Gating¶

In [9]:

%%tikz -l arrows -s 1000,400 -sc 0.65

\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

\newcommand{\lstm}{
%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);

%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);

%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
        \draw[ultra thick, color=\color] (0,0) circle (0.5cm);
    \end{scope}
}

%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
    
    
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, color=black] (0,0) circle (0.05cm);
        \draw[ultra thick, color=black] (0,0) circle (0.5cm);
    \end{scope}
}

%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
    \draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}


\foreach \fx/\fy/\tx/\ty in {
    -5/-3.5/-5/0.85, %xt
    -5/0.85/-4.2/0.85,
    -6.5/4.25/-5/4.25, %ht1
    -5/4.25/-5/1.15,
    -5/1.15/-4.2/1.15,
    -3.75/1/-3/1, %H
    -3/4.25/-3/-2,
    -3/-2/0.25/-2, %i
    0.5/-1.75/0.5/-1.25,
    -3/-1/-2.25/-1, %it
    -1.75/-1/0.25/-1,
    -3/1/-2.25/1, %ft
    -1.75/1/-1.25/1,
    -0.75/1/0/1,
    -3/4.25/-2.25/4.25, %ot
    -1.75/4.25/0.25/4.25,
    0.5/2/0.5/2.75, %ct
    -5.5/2/-5.1/2, %ct1
    -5.5/2/-5.5/1,
    -6.5/1/-5.5/1,
    -4.9/2/-3.1/2,
    -2.9/2/-1/2,
    -1/2/-1/1.25   
} {
    \draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}

\foreach \fx/\fy/\tx/\ty in {
    0.5/-0.75/0.5/0, %it
    -0.75/1/0/1, %ft
    1/1/2.25/1,
    0.5/3.25/0.5/4,
    0.75/4.25/2.25/4.25, %ht    
    0.5/4.5/0.5/6    
} {
    \draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}

\begin{scope}[scale=0.8]                    
\foreach \d in {0} {                    
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {  
    \node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};                                                                                
    \begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]  
        \lstm                    
    \end{scope}   
}
}
\end{scope}          

\node[font=\Huge, anchor=west] at (27,5.75) {$\mathbf{v}$\_Sentence};                                                                                
          

\draw[line width=10pt, color=nice-red, opacity=0.8] (27.6,5) -- (27.6,0.75);
\draw[line width=10pt, color=nice-red, opacity=0.8] (27.5,0.75) -- (3,0.75);
\draw[->, >=stealth', line width=10pt, color=nice-red, opacity=0.8] (2.75,0.75) -- (2.75,-3);
          
          
%dummy node for left alignment
\node[] at (17,0) {};          

Visualizing Gradients¶

RNN vs. LSTM gradients on the input matrix $\mathbf{W}^x$

Error is generated at 128th step and propagated back, no error from other steps.

In [10]:

%%html

<center>
<video controls autoplay loop>
<source src="rnn-figures/vanishing.mp4" type="video/mp4">
</video>
</center>

Stacking (Deep LSTMs)¶

In [11]:

%%tikz -l arrows -s 1100,500 -sc 0.65

\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}

\newcommand{\lstm}{
%lstm first step

%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);

%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);

%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
        \draw[ultra thick, color=\color] (0,0) circle (0.5cm);
    \end{scope}
}

%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
    
    
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
    \begin{scope}[shift={(\w,\h)},scale=0.5]
        \draw[ultra thick, color=black] (0,0) circle (0.05cm);
        \draw[ultra thick, color=black] (0,0) circle (0.5cm);
    \end{scope}
}

%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
    \draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
    \draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}


\foreach \fx/\fy/\tx/\ty in {
    -5/-3.5/-5/0.85, %xt
    -5/0.85/-4.2/0.85,
    -6.5/4.25/-5/4.25, %ht1
    -5/4.25/-5/1.15,
    -5/1.15/-4.2/1.15,
    -3.75/1/-3/1, %H
    -3/4.25/-3/-2,
    -3/-2/0.25/-2, %i
    0.5/-1.75/0.5/-1.25,
    -3/-1/-2.25/-1, %it
    -1.75/-1/0.25/-1,
    -3/1/-2.25/1, %ft
    -1.75/1/-1.25/1,
    -0.75/1/0/1,
    -3/4.25/-2.25/4.25, %ot
    -1.75/4.25/0.25/4.25,
    0.5/2/0.5/2.75, %ct
    -5.5/2/-5.1/2, %ct1
    -5.5/2/-5.5/1,
    -6.5/1/-5.5/1,
    -4.9/2/-3.1/2,
    -2.9/2/-1/2,
    -1/2/-1/1.25   
} {
    \draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}

\foreach \fx/\fy/\tx/\ty in {
    0.5/-0.75/0.5/0, %it
    -0.75/1/0/1, %ft
    1/1/2.25/1,
    0.5/3.25/0.5/4,
    0.75/4.25/2.25/4.25, %ht    
    0.5/4.5/0.5/6    
} {
    \draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}

\begin{scope}[scale=0.8]                    
\foreach \d in {0,1,2} {                    
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {  
    \node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};                                                                                
    \begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]  
        \lstm                    
    \end{scope}   
}
}
\end{scope}          

\node[font=\Huge, anchor=west] at (34,20.75) {$\mathbf{v}$\_Sentence};                                                                                

\draw[line width=10pt, color=nice-red, opacity=0.8] (36.4,16) -- (36.4,20);                    
\draw[line width=10pt, color=nice-red, opacity=0.8] (25.25,16) -- (36.5,16);          
\draw[line width=10pt, color=nice-red, opacity=0.8] (25.25,8.5) -- (25.25,16);          
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,8.5) -- (25.25,8.5);
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,8.5) -- (14,0.75);
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,0.75) -- (3,0.75);
\draw[->, >=stealth', line width=10pt, color=nice-red, opacity=0.8] (2.75,0.75) -- (2.75,-3);
          
          
%dummy node for left alignment
\node[] at (17,0) {};          

Applications¶

Language Modeling
Machine Translation
Question Answering
Dialog Modeling
Language Generation
Sentence Summarization
Paraphrasing
Sentiment Analysis
Recognizing Textual Entailment
...

Learning to Execute¶

RNNs are Turing-Complete [Siegelman, 1995] - they can simulate arbitrary programs, given the proper parameters.

Learning to Execute [Zaremba and Sutskever, 2014]

Implementing LSTM (TensorFlow)¶

lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
probabilities = []
loss_val = 0.0
for batch_of_words in words_in_dataset:
  # State is updated after processing each batch
  output, state = lstm(batch_of_words, state)
  # Output is used to make next word predictions
  scores = tf.matmul(output, out_w) + out_b
  probabilities.append(tf.nn.softmax(scores))
  loss_val += loss(probabilities, target_words)

Pay attention to batching, bucketization and padding

Levels of Granularity¶

Char-Level Language Models: Char-RNN - see e.g. [Karpathy, 2015]

Neural "Lego Blocks"¶

We can combine models! Example: Show and Tell [Vinyals, 2015]

Bidirectional RNNs¶

Problem - for word classification, you may need to incorporate information from both the left and right contexts of the word.

\begin{align} \\ \\ \\ \\ \\ \\ \overleftarrow{\mathbf{h}}_t &= f_{\overleftarrow{\theta}}(\mathbf{x}_t, \overleftarrow{\mathbf{h}}_{t-1})\\ \overrightarrow{\mathbf{h}}_t &= f_{\overrightarrow{\theta}}(\mathbf{x}_t, \overrightarrow{\mathbf{h}}_{t+1})\\ \hat{\mathbf{y}}_t & = g(\overleftarrow{\mathbf{h}}_t, \overrightarrow{\mathbf{h}}_t) \end{align}

$\overleftarrow{\mathbf{h}}_t$ and $\overrightarrow{\mathbf{h}}_t$ represent (summarize) both the past and the future around a given sequence element.

In [ ]:

Recurrent Neural Networks¶

Overview¶

Language Modeling (LMs)¶

$n$-Gram Language Models¶

Recurrent Neural Networks¶

A Recurrent Neural Network LM¶

Example¶

Objective Function¶

Evaluating Language Models¶

Sequence-to-Sequence Models¶

Problem - Training RNNs is Hard¶

Related Problem - Long-Term Dependencies¶

Vanishing/Exploding Gradients - Solutions¶

Long Short-Term Memory (LSTM) Networks¶

Sentence Encoding¶

Gating¶

Visualizing Gradients¶

Stacking (Deep LSTMs)¶

Applications¶

Learning to Execute¶

Implementing LSTM (TensorFlow)¶

Levels of Granularity¶

Neural "Lego Blocks"¶

Bidirectional RNNs¶