In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers - problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.

Goodfellow et al. 2016, Deep Learning

- object recognition
- speech recognition
- speech
*generation* - labeling images

- What are the correct pixel values for a "bike" feature?
- race bike, mountain bike, e-bike?
- pixels in the shadow may be
*much*darker - what if bike is mostly obscured by rider standing in front?

... a layer at a time

- hierarchical feature extraction (start simple, end complex)
- function composition (see http://colah.github.io/posts/2015-09-NN-Types-FP/)

- neuroscientific motivation
- linear models

- binary output (0 or 1)
- neurons may have inhibiting (negative) and excitatory (positive) inputs
- each neuron has a threshold that has to be surpassed by the sum of activations for the neuron to get active (output 1)
- if just one input is inhibitory, the neuron will not activate

- compute linear combination of inputs
- return +1 if result is positive, -1 if result is negative

- Perceptrons can only solve linearly separable problems
- Big loss of interest in neural networks

- distributed representations
- backpropagation gets popular

Several "origins" in different fields, see e.g.

Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954.

Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.

Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.

- Only applicable in case of supervised learning
- Doesn't scale well to multiple layers (as they thought
*at the time*) - Can converge to poor local minima (as they thought
*at the time*)

- everything starts with: Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554.
- deep neural networks can be trained efficiently, if the weights are initialized intelligently
- return of backpropagation

So why the ~~hype~~ success *now*?

It is true that some skill is required to get good performance from a deep learning algorithm. Fortunately, the amount of skill required reduces as the amount of training data increases. The learning algorithms reaching human performance on complex tasks today are nearly identical to the learning algorithms that struggled to solve toy problems in the 1980s [...].

Goodfellow et al. 2016, Deep Learning

As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.

Goodfellow et al. 2016, Deep Learning

thanks to faster/better

- hardware (CPUs, GPUs)
- network infrastructure
- software implementations

Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years.

Goodfellow et al. 2016, Deep Learning

- deep networks consistently win prestigious competitions (e.g., ImageNet)
- deep learning solves increasingly complex problems (e.g., sequence-to-sequence learning)
- deep learning has started to fuel
*other research areas*

and most importantly: Deep learning is *highly profitable*

Deep learning is now used by many top technology companies including Google, Microsoft, Facebook, IBM, Baidu, Apple, Adobe, Netflix, NVIDIA and NEC.

Goodfellow et al. 2016, Deep Learning

So “multi-layer” neural networks do not use the perceptron learning procedure.

They should never have been called multi-layer perceptrons.

Geoffrey Hinton, Neural Networks for Machine Learning Lec. 3

We want to predict

- 0 from [0,0]
- 0 from [1,1]
- 1 from [0,1]
- 1 from [1,0]

$f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b$

- with Mean Squared Error cost (MSE), this leads to: $\mathbf{w}=0, b=0.5$
- mapping every point to 0.5!

$f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T (\mathbf{W}^T\mathbf{x} + \mathbf{c}) + b$

Design matrix: $\mathbf{X} = \begin{bmatrix}0 & 0 \\0 & 1\\1 & 0 \\1 & 1\end{bmatrix}$

Parameters: $\mathbf{W} = \begin{bmatrix}1 & 1 \\1 & 1\end{bmatrix}$, $\mathbf{c} = \begin{bmatrix}0 \\ -1 \end{bmatrix}$, $\mathbf{w} = \begin{bmatrix}1 \\ -2 \end{bmatrix}$

Input to hidden layer: $\mathbf{X}\mathbf{W} = \begin{bmatrix}0 & 0 \\1 & 1\\1 & 1 \\2 & 2\end{bmatrix}$, add $\mathbf{c}$ to every row ==> $\begin{bmatrix}0 & -1 \\1 & 0\\1 & 0 \\2 & 1\end{bmatrix}$

$f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b$

Output of rectified linear transformation: $\begin{bmatrix}0 & 0 \\1 & 0\\1 & 0 \\2 & 1\end{bmatrix}$

The remaining hidden-to-output transformation is linear, but the classes are already linearly separable.

- Like other machine learning algorithms, neural networks learn by
*minimizing a cost function*. - Cost functions in neural networks normally are not convex and so, cannot be optimized in closed form.
- The solution is to do gradient descent.

- Minimize squared error $f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2$
- Closed form: solve
**normal equations**$\mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y}$

- Alternatively, follow the gradient: $\nabla_x f(\mathbf{x})= \mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty}$

How about a net with several layers?

- basically, just the chain rule: $\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$
- chained over several layers:

the

*loss*(or*cost*) function indicates the cost incurred from false prediction / misclassificationprobably the best-known loss function in machine learning is

**mean squared error**:$\frac{1}{n} \sum_n{(\hat{y} - y)^2}$

most of the time, in deep learning we use

**cross entropy**:$- \sum_j{t_j log(y_j)}$

This is the negative log probability of the right answer.

- purpose of activation function: introduce nonlinearity (see above)
for a long time, the sigmoid (logistic) activation function was used a lot:

$y = \frac{1}{1 + e^{-z}}$

now

*rectified linear units*(ReLUs) are preferred:$y = max(0, z)$

- conventional feedforward networks need equally sized input (images for example normally aren't!)
- convolution operation extracts image features

- Strictly, the operation shown above (and implemented in most DL libraries) is not convolution, but cross-correlation
- 1-dimensional discrete convolution: $s(t) = (x * w)(t) = \sum_a{x(a) w(t-a)}$
- 2-dimensional convolution: $S(i,j) = I * K (i,j) = \sum_m \sum_n{I(m,n)K(i-m,j-n)}$
- 2-dimensional cross-correlation: $S(i,j) = I * K (i,j) = \sum_m \sum_n{I(i+m,j+n)K(m,n)}$

```
A = [1,2,3;4,5,6;7,8,9] # input "image"
# padded input matrix, for easier visualization
A_padded = [zeros(1,size(A,2)+2); [zeros(size(A,1),1), A, zeros(size(A,1),1)]; zeros(1,size(A,2)+2)]
B = [1,0;0,0] # kernel
# real convolution
C_full = conv2(A,B,'full') # default
C_same = conv2(A,B,'same')
C_valid = conv2(A,B,'valid')
# cross-correlation
XC = xcorr2(A,B)
```

- Edge enhance: $\begin{bmatrix}0 & 0 & 0\\-1 & 1 & 0\\0 & 0 & 0\end{bmatrix}$, edge detect: $\begin{bmatrix}0 & 1 & 0\\1 & -4 & 1\\0 & 1 & 0\end{bmatrix}$

- Blur: $\begin{bmatrix}1 & 1 & 1\\1 & 1 & 1\\1 & 1 & 1\end{bmatrix}$, sharpen: $\begin{bmatrix}0 & -1 & 0\\-1 & 5 & -1\\0 & -1 & 0\end{bmatrix}$

- predict next word given preceding ones
- based on statistical properties of the distribution of sequences of words

- n-gram/count-based (e.g., Latent Semantic Analysis)
- predictive (neural network language models, e.g., word2vec)

- choose ngram-size n
- estimate the probability $P(w_{t+1}|w_1,...,w_{t−2},w_{t−1},w_t)$ by ignoring context beyond n−1 words and dividing by the count of all given words up till $w_t$
- e.g., with bigrams: $P(w_{t+1}|w_t = \frac{count(w_{t+1},w_t)}{count(w_t)})$

- choose a context size n, as in ngrams
- map each word $w_{t−i}$ in the $n-1$ - word context to an associated d-dimensional feature vector $C_{w_{t-i}}$
- predict next word using standard NN architecture with tanh (hidden layer) resp. softmax (output layer) activation functions
- train network to maximize log likelihood $L(θ)=\sum_t{log P(w_t|w_{t−n+1}, ..., w_{t−1}})$ using stochastic gradient descent

Mikolov et al (2013a). Efficient estimation of word representations in vector space. arXiv:1301.3781.

- Continuous Bag of Words (CBOW)
- Skip-Gram

- cannot process sequential data with "normal" feedforward networks
- in NLP, the n-gram approach cannot handle long-term relationships

Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___

```
def rnn_cell(rnn_input, state):
with tf.variable_scope('rnn_cell', reuse=True):
W = tf.get_variable('W', [num_classes + state_size, state_size])
b = tf.get_variable('b', [state_size], initializer=tf.constant_initializer(0.0))
return tf.tanh(tf.matmul(tf.concat(1, [rnn_input, state]), W) + b)
state = init_state
rnn_outputs = []
for rnn_input in rnn_inputs:
state = rnn_cell(rnn_input, state)
rnn_outputs.append(state)
final_state = rnn_outputs[-1]
```

from: http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html

```
class BasicRNNCell(RNNCell):
"""The most basic RNN cell."""
def __init__(self, num_units, input_size=None, activation=tanh):
self._num_units = num_units
self._activation = activation
@property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""
with vs.variable_scope(scope or "basic_rnn_cell"):
output = self._activation(
_linear([inputs, state], self._num_units, True, scope=scope))
return output, output
```

```
class GRUCell(RNNCell):
"""Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""
def __call__(self, inputs, state, scope=None):
"""Gated recurrent unit (GRU) with nunits cells."""
with vs.variable_scope(scope or "gru_cell"):
with vs.variable_scope("gates"): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
r, u = array_ops.split(value=_linear([inputs, state],
2 * self._num_units,
True,
1.0,
scope=scope),
num_or_size_splits=2,
axis=1)
r, u = sigmoid(r), sigmoid(u)
with vs.variable_scope("candidate"):
c = self._activation(_linear([inputs, r * state],
self._num_units,
True, scope=scope))
new_h = u * state + (1 - u) * c
return new_h, new_h
```

```
class BasicLSTMCell(RNNCell):
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or "basic_lstm_cell"):
c, h = array_ops.split(1, 2, state)
concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)
new_state = array_ops.concat_v2([new_c, new_h], 1)
return new_h, new_state
```

- first RNN encodes the input, second decodes the output
- applications: e.g., machine translation - though basically, all sequence-to-sequence translation!

- Andrej Karpathy, Li Fei-Fei: Deep Visual-Semantic Alignments for Generating Image Descriptions
- combining CNNs, bidirectional RNNs, and multimodal embeddings
- Demo

"If you can express your computation as a data flow graph, you can use TensorFlow."

- represent computations as graphs
- nodes are operations
- edges are
*Tensors*(multidimensional matrices) input to/output from operations - to make anything happen, execute the graph in a
*Session* - a
*Session*places and runs a graph on a*Device*(GPU, CPU)

char-rnn demo

(based on https://github.com/sherjilozair/char-rnn-tensorflow)