A LM computes a probability for a sequence of words
$$p(\langle w_{1}, \ldots, w_{d} \rangle)$$Useful in a miriad of NLP tasks involving text generation, e.g.
In $n$-gram language models, the probability $p(w_{1}, \ldots, w_{d})$ of observing the sentence $(w_{1}, \ldots, w_{d})$ is approximated as:
$$ \begin{aligned} p(w_{1}, \ldots, w_{d}) & = \prod_{i=1}^{d} p(w_{i} \mid w_{1}, \ldots, w_{i - 1}) \\ & \approx \prod_{i=1}^{d} p(w_{i} \mid w_{i - (n - 1)}, \ldots, w_{i - 1}) \\ & \approx \prod_{i=1}^{d} \frac{\text{count}(w_{i - (n - 1)}, \ldots, w_{i})}{\text{count}(w_{i - (n - 1)}, \ldots, w_{i - 1})} \end{aligned} $$Example with a bigram ($n = 2$) language model:
$$ \begin{aligned} p(\langle \text{Natural}, & \text{Language}, \text{Processing} \rangle) \approx \\ & p(\text{Natural}){}\cdot{}p(\text{Language} \mid \text{Natural}) \\ & {}\cdot{}p(\text{Processing} \mid \text{Language}) \end{aligned} $$%%tikz -l arrows -s 1000,400 -sc 0.65
\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (0,0) rectangle (3,3);
}
\lstm
\node[] at (0.5,-1.25) {$\mathbf{x}_t$};
\node[] at (-1.5,2) {$\mathbf{h}_{t-1}$};
\node[] at (4.25,2) {$\mathbf{h}_t$};
\node[] at (2.5,5) {$\mathbf{y}_t$};
\draw[ultra thick, ->, >=stealth'] (0.5,-0.75) -- (0.5,0);
\draw[ultra thick, ->, >=stealth'] (-0.75,2) -- (0,2);
\draw[ultra thick, ->, >=stealth'] (3,2) -- (3.75,2);
\draw[ultra thick, ->, >=stealth'] (2.5,3) -- (2.5,4.75);
\path[line width=3pt, ->, >=stealth', color=nice-red] (4, 2.5) edge[bend right=0, in=-110, out=-70] (-1.75, 2.5);
\node[] at (1.5,2) {$f_\theta(\mathbf{x}_t, \mathbf{h}_{t-1})$};
Consider the following sentence:
$$\langle w_{1}, \ldots, w_{t - 1}, w_{t}, w_{t + 1}, \ldots, w_{d})$$At each single time step $t$, the hidden state $\mathbf{h}_t$ and output $\hat{\mathbf{y}}_t$ are given by:
$$ \begin{aligned} \mathbf{x}_{1} & = \text{encode}(w_{t}) \in \mathbb{R}^{d_{e}}\\ \mathbf{h}_t & = \sigma(\mathbf{W}^h \mathbf{h}_{t-1}+ \mathbf{W}^x \mathbf{x}_t) \in \mathbb{R}^{d_{h}}\\ \hat{\mathbf{y}}_{t} & = \text{softmax}(\mathbf{W}^o \mathbf{h}_{t}) \in \mathbb{R}^{|V|} \\ \end{aligned} $$where $\mathbf{y}_{t} \in [0, 1]^{|V|}$ is a probability distribution over words in $V$.
The probability that the $t$-th word in the sequence is $w_{j}$ is given by:
$$p(w_{j} \mid w_{t}, \ldots, w_{1}) = \hat{\mathbf{y}}_{t, j}$$Consider the word sequence $\text{encode}(\text{Natural}, \text{Language}, \text{Processing}) \rightarrow (\mathbf{x}_{1}, \mathbf{x}_{2}, \mathbf{x}_{3})$
Reminder: $\mathbf{h}_t = \sigma(\mathbf{W}^h \mathbf{h}_{t-1}+ \mathbf{W}^x \mathbf{x}_t + \mathbf{b})$
$$ \begin{aligned} \mathbf{h}_1 = \sigma(\mathbf{W}^h \mathbf{h}_{0} + \mathbf{W}^x \mathbf{x}_1) &\;& \hat{\mathbf{y}}_{1} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{1}) \\ \mathbf{h}_2 = \sigma(\mathbf{W}^h \mathbf{h}_{1} + \mathbf{W}^x \mathbf{x}_2) &\;& \hat{\mathbf{y}}_{2} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{2}) \\ \mathbf{h}_3 = \sigma(\mathbf{W}^h \mathbf{h}_{2} + \mathbf{W}^x \mathbf{x}_3) &\;& \hat{\mathbf{y}}_{3} = \text{softmax}(\mathbf{W}^o \mathbf{h}_{3}) \\ \end{aligned} $$$$p(\text{Natural}, \text{Language}, \text{Processing}) = \hat{\mathbf{y}}_{1, [\text{Natural}]} \; \hat{\mathbf{y}}_{2, [\text{Language}]} \; \hat{\mathbf{y}}_{3, [\text{Processing}]}$$Recall that $\hat{\mathbf{y}}_{t} \in \mathbb{R}^{|V|}$ is a probability distribution over the vocabulary $V$.
We can train a RNN by minimizing the cross-entropy loss, predicting words instead of classes:
$$ \begin{aligned} J_{t} = - \sum_{i = 1}^{|V|} \mathbf{y}_{t, i} \log \hat{\mathbf{y}}_{t, i}, \quad \text{where} \quad \mathbf{y}_{t, i} = \left\{\begin{array}{ll}1 \; \text{if the $t$-th word is $w_{i}$,}\\0 \, \text{otherwise.}\end{array} \right. \end{aligned} $$Evaluation - negative of average log-probability over corpus:
$$J = - \frac{1}{T} \sum_{t = 1}^{T} \sum_{j = 1}^{|V|} \mathbf{y}_{t, j} \log \hat{\mathbf{y}}_{t, j} = \frac{1}{T} J_{t}$$Or also perplexity:
$$PP(w_1,\ldots,w_T) = \sqrt[T]{\prod_{i = 1}^T \frac{1}{p(w_i | w_{1}, \ldots, w_{i-1})}}$$Recurrent Neural Networks are extremely powerful and flexible
Seq2Seq models are composed by:
Seq2Seq models are widely popular in e.g.:
%%tikz -l arrows -s 1000,400 -sc 0.65
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
\newcommand{\lstm}{
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (0,0) rectangle (3,3);
\draw[ultra thick, ->, >=stealth'] (0.5,-0.75) -- (0.5,0);
\draw[ultra thick, ->, >=stealth'] (-0.75,2) -- (0,2);
\draw[ultra thick, ->, >=stealth'] (3,2) -- (3.75,2);
\draw[ultra thick, ->, >=stealth'] (2.5,3) -- (2.5,3.75);
}
%\lstm
%\node[] at (0.5,-1.25) {$\mathbf{x}_t$};
%\node[] at (-1.5,2) {$\mathbf{h}_{t-1}$};
%\node[] at (4.25,2) {$\mathbf{h}_t$};
%\node[] at (2.5,5) {$\mathbf{h}_t$};
%\path[line width=3pt, ->, >=stealth', color=nice-blue] (4, 2.5) edge[bend right=0, in=-110, out=-70] (-1.75, 2.5);
%\node[] at (1.5,2) {$f_\theta(\mathbf{x}_t, \mathbf{h}_{t-1})$};
\foreach \x/\w in {0/I, 1/like, 2/neural, 3/networks} {
\begin{scope}[shift={(\x*3.75,0)}]
\lstm
\node[font=\LARGE, text height=1.5ex, color=nice-red] at (0.5,-1.5) {\bf\w};
\end{scope}
}
\foreach \x/\w/\t in {4/EOS/Ich, 5/Ich/mag, 6/mag/neuronale, 7/neuronale/Netze, 8/Netze/EOS} {
\begin{scope}[shift={(\x*3.75,0)}]
\lstm
\node[font=\LARGE, text height=1.5ex] at (0.5,-1.5) {\bf\w};
\node[font=\LARGE, text height=1.5ex, color=nice-blue] at (2.5,4.5) {\bf\t};
\end{scope}
}
\node[font=\Huge, color=nice-red] at (16.5,1.5) {$\mathbf{v}$};
Why? Multiply the same matrix $\mathbf{W}^{h}$ at each time step during forward propagation. The norm of the gradient might either tend to 0 (vanish) or be too large (explode).
Words from time steps far away are hardly considered when training to predict the next word.
Example:
A RNN is very likely to e.g. put an uniform probability distributions over nouns in $V$, and a low probability everywhere else.
It's an issue with language modeling, question answering, and many other tasks.
Several solutions in the literature:
Bound the gradient to a threshold (Gradient Clipping)
[Pascanu et al. 2013]
Use $\text{ReLU}(x) = \max(0, x)$ (Rectified Linear Units) or similar non-linearities instead of $\text{sigmoid}(x)$ or $\text{tanh}(x)$
[Glorot et al. 2011].
Clever Initialization of the Transition Matrix ($\mathbf{W}^h = \mathbf{I}$)
[Socher et al. 2013, Le et al. 2015].
Use different recurrent models that favour backpropagation
LSTM[Hochreiter et al. 1997], GRU[Chung et al. 2014].
%%tikz -l arrows -s 1000,400 -sc 0.65
\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);
%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);
%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
\draw[ultra thick, color=\color] (0,0) circle (0.5cm);
\end{scope}
}
%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, color=black] (0,0) circle (0.05cm);
\draw[ultra thick, color=black] (0,0) circle (0.5cm);
\end{scope}
}
%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
\draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}
\foreach \fx/\fy/\tx/\ty in {
-5/-3.5/-5/0.85, %xt
-5/0.85/-4.2/0.85,
-6.5/4.25/-5/4.25, %ht1
-5/4.25/-5/1.15,
-5/1.15/-4.2/1.15,
-3.75/1/-3/1, %H
-3/4.25/-3/-2,
-3/-2/0.25/-2, %i
0.5/-1.75/0.5/-1.25,
-3/-1/-2.25/-1, %it
-1.75/-1/0.25/-1,
-3/1/-2.25/1, %ft
-1.75/1/-1.25/1,
-0.75/1/0/1,
-3/4.25/-2.25/4.25, %ot
-1.75/4.25/0.25/4.25,
0.5/2/0.5/2.75, %ct
-5.5/2/-5.1/2, %ct1
-5.5/2/-5.5/1,
-6.5/1/-5.5/1,
-4.9/2/-3.1/2,
-2.9/2/-1/2,
-1/2/-1/1.25
} {
\draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}
\foreach \fx/\fy/\tx/\ty in {
0.5/-0.75/0.5/0, %it
-0.75/1/0/1, %ft
1/1/2.25/1,
0.5/3.25/0.5/4,
0.75/4.25/2.25/4.25, %ht
0.5/4.5/0.5/6
} {
\draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}
%\begin{scope}[scale=0.8]
%\foreach \d in {0,1} {
%\foreach \t in {0,1,2,3,4} {
%\begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]
% \lstm
%\end{scope}
%}
%}
%\end{scope}
\lstm
%annotations
\node[] at (-5,-3.75) {$\mathbf{x}_t$};
\node[anchor=east] at (-6.5,4.25) {$\mathbf{h}_{t-1}$};
\node[anchor=east] at (-6.5,1) {$\mathbf{c}_{t-1}$};
\node[] at (0.5,6.25) {$\mathbf{h}_t$};
\node[anchor=west] at (2.25,4.25) {$\mathbf{h}_t$};
\node[anchor=west] at (2.25,1) {$\mathbf{c}_t$};
\node[xshift=0.4cm,yshift=0.25cm] at (-4,1) {$\mathbf{H}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,-1) {$\mathbf{i}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,1) {$\mathbf{f}_t$};
\node[xshift=0.35cm,yshift=0.25cm] at (-2,4.25) {$\mathbf{o}_t$};
%dummy node for left alignment
\node[] at (17,0) {};
%%tikz -l arrows -s 1000,400 -sc 0.65
\newcommand{\lstm}{
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);
%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);
%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
\draw[ultra thick, color=\color] (0,0) circle (0.5cm);
\end{scope}
}
%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, color=black] (0,0) circle (0.05cm);
\draw[ultra thick, color=black] (0,0) circle (0.5cm);
\end{scope}
}
%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
\draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}
\foreach \fx/\fy/\tx/\ty in {
-5/-3.5/-5/0.85, %xt
-5/0.85/-4.2/0.85,
-6.5/4.25/-5/4.25, %ht1
-5/4.25/-5/1.15,
-5/1.15/-4.2/1.15,
-3.75/1/-3/1, %H
-3/4.25/-3/-2,
-3/-2/0.25/-2, %i
0.5/-1.75/0.5/-1.25,
-3/-1/-2.25/-1, %it
-1.75/-1/0.25/-1,
-3/1/-2.25/1, %ft
-1.75/1/-1.25/1,
-0.75/1/0/1,
-3/4.25/-2.25/4.25, %ot
-1.75/4.25/0.25/4.25,
0.5/2/0.5/2.75, %ct
-5.5/2/-5.1/2, %ct1
-5.5/2/-5.5/1,
-6.5/1/-5.5/1,
-4.9/2/-3.1/2,
-2.9/2/-1/2,
-1/2/-1/1.25
} {
\draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}
\foreach \fx/\fy/\tx/\ty in {
0.5/-0.75/0.5/0, %it
-0.75/1/0/1, %ft
1/1/2.25/1,
0.5/3.25/0.5/4,
0.75/4.25/2.25/4.25, %ht
0.5/4.5/0.5/6
} {
\draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}
\begin{scope}[scale=0.8]
\foreach \d in {0} {
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {
\node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};
\begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]
\lstm
\end{scope}
}
}
\end{scope}
\node[font=\Huge, anchor=west] at (27,5.75) {$\mathbf{v}$\_Sentence};
%dummy node for left alignment
\node[] at (17,0) {};
%%tikz -l arrows -s 1000,400 -sc 0.65
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
\newcommand{\lstm}{
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);
%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);
%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
\draw[ultra thick, color=\color] (0,0) circle (0.5cm);
\end{scope}
}
%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, color=black] (0,0) circle (0.05cm);
\draw[ultra thick, color=black] (0,0) circle (0.5cm);
\end{scope}
}
%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
\draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}
\foreach \fx/\fy/\tx/\ty in {
-5/-3.5/-5/0.85, %xt
-5/0.85/-4.2/0.85,
-6.5/4.25/-5/4.25, %ht1
-5/4.25/-5/1.15,
-5/1.15/-4.2/1.15,
-3.75/1/-3/1, %H
-3/4.25/-3/-2,
-3/-2/0.25/-2, %i
0.5/-1.75/0.5/-1.25,
-3/-1/-2.25/-1, %it
-1.75/-1/0.25/-1,
-3/1/-2.25/1, %ft
-1.75/1/-1.25/1,
-0.75/1/0/1,
-3/4.25/-2.25/4.25, %ot
-1.75/4.25/0.25/4.25,
0.5/2/0.5/2.75, %ct
-5.5/2/-5.1/2, %ct1
-5.5/2/-5.5/1,
-6.5/1/-5.5/1,
-4.9/2/-3.1/2,
-2.9/2/-1/2,
-1/2/-1/1.25
} {
\draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}
\foreach \fx/\fy/\tx/\ty in {
0.5/-0.75/0.5/0, %it
-0.75/1/0/1, %ft
1/1/2.25/1,
0.5/3.25/0.5/4,
0.75/4.25/2.25/4.25, %ht
0.5/4.5/0.5/6
} {
\draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}
\begin{scope}[scale=0.8]
\foreach \d in {0} {
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {
\node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};
\begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]
\lstm
\end{scope}
}
}
\end{scope}
\node[font=\Huge, anchor=west] at (27,5.75) {$\mathbf{v}$\_Sentence};
\draw[line width=10pt, color=nice-red, opacity=0.8] (27.6,5) -- (27.6,0.75);
\draw[line width=10pt, color=nice-red, opacity=0.8] (27.5,0.75) -- (3,0.75);
\draw[->, >=stealth', line width=10pt, color=nice-red, opacity=0.8] (2.75,0.75) -- (2.75,-3);
%dummy node for left alignment
\node[] at (17,0) {};
RNN vs. LSTM gradients on the input matrix $\mathbf{W}^x$
%%html
<center>
<video controls autoplay loop>
<source src="rnn-figures/vanishing.mp4" type="video/mp4">
</video>
</center>
%%tikz -l arrows -s 1100,500 -sc 0.65
\definecolor{nice-red}{HTML}{E41A1C}
\definecolor{nice-orange}{HTML}{FF7F00}
\definecolor{nice-yellow}{HTML}{FFC020}
\definecolor{nice-green}{HTML}{4DAF4A}
\definecolor{nice-blue}{HTML}{377EB8}
\definecolor{nice-purple}{HTML}{984EA3}
\newcommand{\lstm}{
%lstm first step
%lstm module box
\draw[line width=3pt, color=black!50] (-6,-3) rectangle (1.5,5.25);
\draw[ultra thick] (0,0) rectangle (1,2);
%memory ct
\draw[ultra thick, color=nice-purple, fill=nice-purple!10] (0,0) rectangle (1,2);
%non-linearities
\foreach \w/\h/\color in {-2/4.25/nice-blue,-2/1/nice-red,-2/-1/nice-green,0.5/-2/nice-yellow,0.5/3/black} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, yshift=-0.5cm, color=\color] plot [domain=-0.3:0.3](\x, {(0.8/(1+exp(-15*\x))+0.1)});
\draw[ultra thick, color=\color] (0,0) circle (0.5cm);
\end{scope}
}
%tanh
\draw[thick, color=black] (0.25,3) -- (0.75,3);
\draw[thick, color=nice-yellow] (0.25,-2) -- (0.75,-2);
%component-wise multiplications
\foreach \w/\h in {-1/1,0.5/-1,0.5/4.25} {
\begin{scope}[shift={(\w,\h)},scale=0.5]
\draw[ultra thick, color=black] (0,0) circle (0.05cm);
\draw[ultra thick, color=black] (0,0) circle (0.5cm);
\end{scope}
}
%vector concat
\begin{scope}[shift={(-4,1)},scale=0.5]
\draw[ultra thick,yshift=0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick,yshift=-0.2cm] (0,0) circle (0.05cm);
\draw[ultra thick] (0,0) circle (0.5cm);
\end{scope}
\foreach \fx/\fy/\tx/\ty in {
-5/-3.5/-5/0.85, %xt
-5/0.85/-4.2/0.85,
-6.5/4.25/-5/4.25, %ht1
-5/4.25/-5/1.15,
-5/1.15/-4.2/1.15,
-3.75/1/-3/1, %H
-3/4.25/-3/-2,
-3/-2/0.25/-2, %i
0.5/-1.75/0.5/-1.25,
-3/-1/-2.25/-1, %it
-1.75/-1/0.25/-1,
-3/1/-2.25/1, %ft
-1.75/1/-1.25/1,
-0.75/1/0/1,
-3/4.25/-2.25/4.25, %ot
-1.75/4.25/0.25/4.25,
0.5/2/0.5/2.75, %ct
-5.5/2/-5.1/2, %ct1
-5.5/2/-5.5/1,
-6.5/1/-5.5/1,
-4.9/2/-3.1/2,
-2.9/2/-1/2,
-1/2/-1/1.25
} {
\draw[ultra thick] (\fx,\fy) -- (\tx,\ty);
}
\foreach \fx/\fy/\tx/\ty in {
0.5/-0.75/0.5/0, %it
-0.75/1/0/1, %ft
1/1/2.25/1,
0.5/3.25/0.5/4,
0.75/4.25/2.25/4.25, %ht
0.5/4.5/0.5/6
} {
\draw[->, >=stealth', ultra thick] (\fx,\fy) -- (\tx,\ty);
}
}
\begin{scope}[scale=0.8]
\foreach \d in {0,1,2} {
\foreach \t/\word in {0/A,1/wedding,2/party,3/taking,4/pictures} {
\node[font=\Huge, anchor=west] at (\t*8.5-5.75,-4.5) {$\mathbf{v}$\_\word};
\begin{scope}[shift={(\t*8.5+\d*5.5,\d*9.5)}]
\lstm
\end{scope}
}
}
\end{scope}
\node[font=\Huge, anchor=west] at (34,20.75) {$\mathbf{v}$\_Sentence};
\draw[line width=10pt, color=nice-red, opacity=0.8] (36.4,16) -- (36.4,20);
\draw[line width=10pt, color=nice-red, opacity=0.8] (25.25,16) -- (36.5,16);
\draw[line width=10pt, color=nice-red, opacity=0.8] (25.25,8.5) -- (25.25,16);
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,8.5) -- (25.25,8.5);
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,8.5) -- (14,0.75);
\draw[line width=10pt, color=nice-red, opacity=0.8] (14,0.75) -- (3,0.75);
\draw[->, >=stealth', line width=10pt, color=nice-red, opacity=0.8] (2.75,0.75) -- (2.75,-3);
%dummy node for left alignment
\node[] at (17,0) {};
RNNs are Turing-Complete [Siegelman, 1995] - they can simulate arbitrary programs, given the proper parameters.
Learning to Execute [Zaremba and Sutskever, 2014]
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
probabilities = []
loss_val = 0.0
for batch_of_words in words_in_dataset:
# State is updated after processing each batch
output, state = lstm(batch_of_words, state)
# Output is used to make next word predictions
scores = tf.matmul(output, out_w) + out_b
probabilities.append(tf.nn.softmax(scores))
loss_val += loss(probabilities, target_words)
Problem - for word classification, you may need to incorporate information from both the left and right contexts of the word.