Configuration¶

$$ \newcommand{\Xs}{\mathcal{X}} \newcommand{\Ys}{\mathcal{Y}} \newcommand{\y}{\mathbf{y}} \newcommand{\repr}{\mathbf{f}} \newcommand{\repry}{\mathbf{g}} \newcommand{\x}{\mathbf{x}} \newcommand{\vocab}{V} \newcommand{\params}{\boldsymbol{\theta}} \newcommand{\param}{\theta} \DeclareMathOperator{\perplexity}{PP} \DeclareMathOperator{\argmax}{argmax} \DeclareMathOperator{\argmin}{argmin} \newcommand{\train}{\mathcal{D}} \newcommand{\counts}[2]{\#_{#1}(#2) } \newcommand{\indi}{\mathbb{I}} $$

In [1]:

%%capture
%load_ext autoreload
%load_ext tikzmagic
%autoreload 2
import sys
sys.path.append("..")
import numpy as np

#reveal configuration
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'theme': 'white',
        'transition': 'none',
        'controls': 'false',
        'progress': 'true',
})

In [2]:

%%html
<style>
.red { color: #E41A1C; }
.orange { color: #FF7F00 }
.yellow { color: #FFC020 }         
.green { color: #4DAF4A }                  
.blue { color: #377EB8; }
.purple { color: #984EA3 }       
       
h1 {
    color: #377EB8;
}
       
ctb_global_show div.ctb_hideshow.ctb_show {
    display: inline;
} 
         
div.tabContent {
    padding: 0px;
    background: #ffffff;     
    border: 0px;                        
}  
         
.left {
    float: left;
    width: 50%;
    vertical-align: text-top;
}

.right {
    margin-left: 50%;
    vertical-align: text-top;                            
}    
               
.small {         
    zoom: 0.9;
    -ms-zoom: 0.9;
    -webkit-zoom: 0.9;
    -moz-transform:  scale(0.9,0.9);
    -moz-transform-origin: left center;  
}          
         
.verysmall {         
    zoom: 0.75;
    -ms-zoom: 0.75;
    -webkit-zoom: 0.75;
    -moz-transform:  scale(0.75,0.75);
    -moz-transform-origin: left center;  
}         
   
        
.tiny {         
    zoom: 0.6;
    -ms-zoom: 0.6;
    -webkit-zoom: 0.6;
    -moz-transform:  scale(0.6,0.6);
    -moz-transform-origin: left center;  
}         
         
         
.rendered_html blockquote {
    border-left-width: 0px;
    padding: 15px;
    margin: 0px;    
    width: 100%;                            
}         
         
.rendered_html th {
    padding: 0.5em;  
    border: 0px;                            
}         
         
.rendered_html td {
    padding: 0.25em;
    border: 0px;                                                        
}    
     
#for reveal         
.aside .controls, .reveal .controls {
    display: none !important;                            
    width: 0px !important;
    height: 0px !important;
}
    
.rise-enabled .reveal .slide-number {
    right: 25px;
    bottom: 25px;                        
    font-size: 200%;     
    color: #377EB8;                        
}         
         
.rise-enabled .reveal .progress span {
    background: #377EB8;
}     
         
.present .top {
    position: fixed !important;
    top: 0 !important;                                   
}                  
    
.present .rendered_html * + p, .present .rendered_html p, .present .rendered_html * + br, .present .rendered_html br {
    margin: 0.5em 0;                            
}  
         
.present tr, .present td {
    border: 0px;
    padding: 0.35em;                            
}      
         
.present th {
    border: 1px;
}
         
present .prompt {
    min-width: 0px !important;
    transition-duration: 0s !important;
}     
         
.prompt {
    min-width: 0px !important;
    transition-duration: 0s !important;                            
}         
         
.rise-enabled .cell li {
    line-height: 135%;
}
         
</style>

An Introduction to Deep Learning for Natural Language Processing

A Subjective History of Deep Learning¶

Disclaimer¶

The field of Deep Learning is young but fast-changing and diverse due to very active research
I can only give you a small overview on Deep Learning
I won't talk about vision, convolutional networks etc.
Many things that I explain today will be outdated next year/month

A More or Less Objective View

A Personal View

Feature Engineering, Classification, Support Vector Machines

Machine Learning

Graphical Models, Structured Prediction, Probabilistic Inference, Feature Engineering

Relation Extraction, Matrix Factorization, Representation Learning

Representation Learning, Deep Learning

Deep Learning in a Nutshell¶

The Success Story of Deep Learning¶

State of the art performance for countless real-world tasks (too much to list)
Huge investements from industry (Google, Facebook, Apple etc.)
Many new Deep Learning start-ups
Very active and open research community
"There's something magical about Recurrent Neural Networks" -- Andrej Karpathy

Continuous Optimization, Modularity and Backpropagation¶

Preliminaries: Model¶

Change of notation: $$ s_\params(\x,y) \in \mathbb{R} $$ becomes $$ f_\params(\x)_y \in \mathbb{R} $$

where $f_\params(\x) \in \mathbb{R}^{|\Ys|}$ represents the scores for each possible solution $y$

Preliminaries: Model¶

Model: some function $f_\theta$ parameterized by $\theta$ that we want to learn from data $\mathcal{D}=\{(x_i,y_i)\}$, for example
- Linear Regression

$$ f_\theta(\mathbf{x}) = \mathbf{Wx} + \mathbf{b} \quad\text{with }\theta = \{\mathbf{W}, \mathbf{b}\} $$

Logistic Regression

$$ f_\theta(\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{Wx} + \mathbf{b})}} \quad\text{with }\theta = \{\mathbf{W}, \mathbf{b}\} $$

3-layer Perceptron

$$ f_\theta(\mathbf{x}) = \text{tanh}(\mathbf{W}_3\text{tanh}(\mathbf{W}_2\text{tanh}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)+\mathbf{b}_2)+\mathbf{b}_3)\\ \quad\text{with }\theta = \{\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3, \mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3\} $$

Preliminaries: Loss Functions¶

A function $\mathcal{L}$ that given a model $f_\theta$, input $x$ and gold output $y$ measures how far we are away from the truth, for example

Squared distance

$$ \mathcal{L}(f_\theta, x, y) = ||f_\theta(x) - y||^2 $$

Logistic

$$ \mathcal{L}(f_\theta, x, y) = \log(1 + f_\theta(yx)) $$

Hinge

$$ \mathcal{L}(f_\theta, x, y) = \max(0,1-yf_\theta(x)) $$

Stochastic Gradient Descent¶

Goal: find parameters $\theta$ of model $f_\theta$ that minimize loss function $\mathcal{L}$

Initialize parameters $\theta$
Shuffle training data $\mathcal{D}$

For every example $(x_i,y_i) \in \mathcal{T}$
1. Find direction of parameters that improves loss
- Calculate gradient of parameters w.r.t. loss $\frac{\partial \mathcal{L}(f_\theta, x_i, y_i)}{\partial \theta}$
1. Update parameters with learning rate $\alpha$
- $\theta := \theta - \alpha*\frac{\partial \mathcal{L}(f_\theta, x_i, y_i)}{\partial \theta}$
Go to 2.

Perceptron: A Single Neuron¶

\begin{align} z &= \text{sigmoid}(x_1*w_1 + x_2*w_2 + x_3*w_3 + x_4*w_4 + b)\\ &= \text{sigmoid}(\mathbf{x}\cdot\mathbf{w} + b) \quad\text{with }\mathbf{x},\mathbf{w}\in\mathbb{R}^4 \end{align}

Multiple Neurons¶

\begin{align} z_1 &= \text{sigmoid}(\mathbf{x}\cdot\mathbf{w_1} + b_1)\\ z_2 &= \text{sigmoid}(\mathbf{x}\cdot\mathbf{w_2} + b_2) \end{align}

Multiple Neurons¶

$f_\theta: \mathbb{R}^4 \to \mathbb{R}^2$

\begin{align} \mathbf{z} &= \text{sigmoid}(\mathbf{W}\mathbf{x} + \mathbf{b}) \quad\text{ with } \mathbf{W}\in\mathbb{R}^{2\times4}, \mathbf{b}\in\mathbb{R}^{2} \end{align}

Modularity: Multi-layer Perceptron¶

\begin{align} f_{1,\theta} &: \mathbb{R}^5 \to \mathbb{R}^3\\ f_{2,\theta} &: \mathbb{R}^3 \to \mathbb{R}^3\\ f_{3,\theta} &: \mathbb{R}^3 \to \mathbb{R}^1\\ g_\theta &= f_{3,\theta} \circ f_{2,\theta} \circ f_{1,\theta}\\ g_\theta(\mathbf{x}) &= f_{3,\theta}(f_{2,\theta}(f_{1,\theta}(\mathbf{x})))\\ g_\theta &: \mathbb{R}^5 \to \mathbb{R}^1 \end{align}

Calculation of Gradients¶

\begin{align} g_\theta(\mathbf{x}) &= \text{sigmoid}(\mathbf{W}^{1\times 3}_3\text{sigmoid}(\mathbf{W}^{3\times 3}_2\text{sigmoid}(\mathbf{W}^{3\times 5}_1\mathbf{x}+\mathbf{b}_1)+\mathbf{b}_2)+\mathbf{b}_3)\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{W}^{1\times 3}_3} &= \text{ ?}\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{b}_3} &= \text{ ?}\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{W}^{3\times 3}_2} &= \text{ ?}\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{b}_2} &= \text{ ?}\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{W}^{3\times 5}_1} &= \text{ ?}\\ \frac{\partial \mathcal{L}(f_\theta, \mathbf{x}, \mathbf{y})}{\partial \mathbf{b}_1} &= \text{ ?} \end{align}

Chain Rule¶

\begin{align} \frac{\partial f \circ g}{\partial \theta} &= \frac{\partial f \circ g}{\partial g} \frac{\partial g}{\partial \theta}\\ \end{align}

Example¶

\begin{align} \frac{\partial \mathcal{L}(\text{sigmoid}(\mathbf{W}\mathbf{x}),\mathbf{y})}{\partial \mathbf{W}} &= \frac{\partial \mathcal{L}(\text{sigmoid}(\mathbf{W}\mathbf{x}),\mathbf{y})}{\partial \text{ sigmoid}(\mathbf{W}\mathbf{x})} \frac{\partial \text{ sigmoid}(\mathbf{W}\mathbf{x})}{\partial \mathbf{Wx}} \frac{\partial{\mathbf{Wx}}}{\partial\mathbf{W}} \end{align}

\begin{align} \mathbf{h} &= \mathbf{W}\mathbf{x}\\ \mathbf{z} &= \text{sigmoid}(\mathbf{h})\\ \mathcal{L}(\mathbf{z},\mathbf{y}) &= \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2 \end{align}

\begin{align} \frac{\mathcal{\partial \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2}}{\partial \mathbf{W}} &= \frac{\partial \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2}{\partial\mathbf{z}} \frac{\partial\mathbf{z}}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{W}} \end{align}

Example cont.¶

\begin{align} \mathbf{h} &= \mathbf{W}\mathbf{x}\\ \mathbf{z} &= \text{sigmoid}(\mathbf{h})\\ \mathcal{L}(\mathbf{z},\mathbf{y}) &= \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2 \end{align}

\begin{align} \frac{\mathcal{\partial \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2}}{\partial \mathbf{W}} &= \frac{\partial \frac{1}{2}||\mathbf{z} - \mathbf{y}||^2}{\partial\mathbf{z}} \frac{\partial\mathbf{z}}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{W}}\\ \partial \mathbf{z} &= \mathbf{z}-\mathbf{y}\\ \partial \mathbf{h} &= \partial \mathbf{z}\,\text{sigmoid}(\mathbf{h})\,(1 - \text{sigmoid}(\mathbf{h}))\\ \partial \mathbf{W} &= \partial\mathbf{h}\otimes\mathbf{x} \end{align}

Module¶

Backpropagation¶

Deep Learning Libraries¶

pytorch
dynet
Theano
DeepLearning4J
autograd
TensorFlow
...

Logistic Regression¶

In [11]:

import tensorflow as tf
seed = 0
#input
input_sz = 3
output_sz = 1
x = tf.placeholder("float")
#parameters
W = tf.Variable(tf.random_uniform([output_sz,input_sz], -0.1, 0.1, seed=seed))
b = tf.Variable(tf.zeros(output_sz))
#f_theta
z = tf.nn.sigmoid(tf.matmul(W,x) + b) #sigmoid(Wx + b)

In [12]:

sess = tf.Session()
sess.run(tf.global_variables_initializer()) #initialize W and b
sess.run(W)

Out[12]:

array([[-0.07982747,  0.09403337,  0.06975283]], dtype=float32)

In [13]:

sess.run(b)

Out[13]:

array([ 0.], dtype=float32)

Logistic Regression cont.¶

Forward: $\mathbf{z} = f_\theta(\mathbf{x})$

In [14]:

sess.run(z, feed_dict={x: [[-5.5],[2.0],[-0.5]]})

Out[14]:

array([[ 0.64387923]], dtype=float32)

Backward: $\partial\mathbf{W},\partial\mathbf{b},\partial\mathbf{x}$ given upstream gradient $\partial\mathbf{z}$

In [7]:

sess.run(tf.global_variables_initializer())
gradz = [[0.1]] 
grad = tf.gradients(z,[W, b, x], grad_ys=gradz)
sess.run(grad, feed_dict={x: [[-5.5],[2.0],[-0.5]]})

Out[7]:

[array([[-0.13708647,  0.04984963, -0.01246241]], dtype=float32),
 array([ 0.02492481], dtype=float32),
 array([[ 0.00034354],
        [-0.00093437],
        [-0.00204336]], dtype=float32)]

Multi-layer Perceptron¶

In [15]:

#input
x = tf.placeholder(tf.float32, shape=[5,1])
#parameters
W1 = tf.Variable(tf.random_uniform([3,5], seed=seed))
b1 = tf.Variable(tf.zeros([3,1]))
W2 = tf.Variable(tf.random_uniform([3,3], seed=seed))
b2 = tf.Variable(tf.zeros([3,1]))
W3 = tf.Variable(tf.random_uniform([1,3], seed=seed))
b3 = tf.Variable(tf.zeros([1,1]))
#model
h1 = tf.nn.sigmoid(tf.matmul(W1,x) + b1) 
h2 = tf.nn.sigmoid(tf.matmul(W2,h1) + b2)
mlp_z = tf.matmul(W3,h2) + b3 

sess.run(tf.global_variables_initializer())
x_value = [[-5.5], [2.0], [-0.5], [2.0], [4.0]]
sess.run(mlp_z, feed_dict={x: x_value})

Out[15]:

array([[ 1.35592151]], dtype=float32)

Training¶

In [20]:

target_z = tf.constant([[1.0]]) # what the output should be
loss = tf.square(target_z - mlp_z) # the loss function 
optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
opt_op = optimizer.minimize(loss) # the TF operation that performs optimisation steps
sess.run(tf.global_variables_initializer())
for epoch in range(0,5):
    _, loss_value = sess.run([opt_op, loss], feed_dict={x: x_value})
    if epoch % 1 == 0:
        print(loss_value)

[[ 0.07483862]]
[[ 0.00129254]]
[[  7.06945139e-06]]
[[  3.01882075e-08]]
[[  1.25567112e-10]]

It learned!

In [21]:

sess.run(mlp_z, feed_dict={x: x_value})

Out[21]:

array([[ 0.99999923]], dtype=float32)

Next¶

Input are always (continuous) vectors.

What vectors to use in NLP?