Notebook

In [1]:

import numpy as np
from tqdm import tqdm
from sklearn.metrics import confusion_matrix

from seaborn import despine
import seaborn as sns
sns.set_style("ticks")
sns.set_context("talk")

from IPython.display import Image

import matplotlib.pyplot as plt
%matplotlib inline

0. Quick overview¶

The goal of this notebook is to introduce convolutional neural networks; A neural network architecture that is (loosely) inspired by the biological vision system and used to process visual image (or video) data (e.g., to identify the objects depicted in an image).

You will further get a basic introduction to Tensorflow and Keras; two of the most widely used deep learning libraries.

If you stick around, you will also learn a bit about Google's deep dream algorithm at the end of the notebook.

1. Classifying images of clothing pieces¶

In this notebook, we will be using the Fashion-MNIST dataset.

This dataset contains 70,000 28x28 pixel grey-scale images of clothing pieces, encompassing a total of 10 different categories of clothing.

We can load the dataset with the fashion_mnist.load_data() function from tensorflow.keras.datasets.

(More details on Tensorflow and Keras will follow later).

In [2]:

from tensorflow.keras import datasets

In [3]:

# load Fashion MNIST data
(train_images, train_labels), (test_images, test_labels) = datasets.fashion_mnist.load_data()

# define class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

This dataset is already divided into a distinct training and test dataset, each containing 60000 and 10000 images:

In [4]:

train_images.shape, test_images.shape

Out[4]:

((60000, 28, 28), (10000, 28, 28))

To make these images easily digestible for an artificial neural network, we will normalize the range of the pixel values to a range between 0 and 1:

In [5]:

# Normalize pixel values to be between 0 and 1
train_images, test_images = np.expand_dims(train_images / 255.0, -1), np.expand_dims(test_images / 255.0, -1)

In [6]:

fig, ax = plt.subplots(1,1,figsize=(6,6), dpi=50)
ax.hist(train_images.ravel())
ax.set_xlabel('Pixel value')
ax.set_ylabel('Frequency')
despine(ax=ax)

Ok, the normalization worked; The high frequency of 0 values is based on the consistent black background of each image.

Let's take a look at a few of the training examples:

In [7]:

fig, axs = plt.subplots(5,5,figsize=(12,12), dpi=100)
axs = axs.ravel()
for i, ax in enumerate(axs):
    ax.imshow(train_images[i,...,0], cmap='gray')
    ax.set_title(class_names[train_labels[i]])
    ax.set_xticks([])
    ax.set_yticks([])
fig.tight_layout()
fig.savefig('figures/Figure-2-0_Fashion-MNIST.png', dpi=600)

2. Convolutional Neural Networks¶

2.1 Biological inspiration¶

A very simple way to classify this dataset would be to flatten each sample to a vector of 784 values (28x28) and to then feed them to a fully-connected (or dense) artificial neural network (as we did with the handwritten digits in the previous notebook (1-Neural-Networks-Backpropagation.ipynb).

However, there is another common type of artificial neural network architecture that was specifically designed for this type of computer vision problem:

These networks are called convolutional neural networks (or CNNs) and are (very) loosely inspired by our knowledge of the neurobiological processes underlying the vision system.

On a superficial level, the brain processes visual information in the following steps (see the figure below):

First, a projection of the image is registered on the retina. The retina is composed of various types of neurons, each sensitive to specific characteristics of the perceived image. Each neuron further has a receptive field, restricting its sensitivity to specific area of the input image.

These retina neurons are connected to the optical nerve, which sends their output signals to the part of the visual system in the brain.

The visual system encompasses multiple brain regions (see for example the primary visual cortex, which each seem to respond to different patterns in (or aspects of) the visual input.

Importantly, lower-level brain regions of the visual system (which are closer to the visual input) are generally more sensitive to basic properties of the input (such as orientations, colors, or contrasts), while higher-level brain regions generally respond to more complex patterns (famous examples are the fusiform face area and parahippocampal place area which specifically respond to faces and scenes).

In [8]:

Image(filename='materials/images/free-use/Human-Vision.jpg')

Out[8]:

Adapted from: Kubilius, Jonas (2017): Ventral visual stream. figshare. Figure. https://doi.org/10.6084/m9.figshare.106794.v3

2.2. Convolution Kernels (or filters)¶

Convolution kernels (or filters) represent the core building block of a convolutional neural networks. Conceptually, each kernel mimics a neuron of the visual system and responds to a specific pattern in the visual input.

Computationally, each kernel $k$ is composed of a small square matrix (typically with a size 3x3 or 5x5 values).

To obtain an activation map with a kernel (indicating whether the pattern to which the kernel responds is present at each location of the input), we can perform a spatial convolution between the kernel and image.

To do this, we simply move the kernel over the image and compute a weighted sum between the values of the kernel and the values of the underlying image at each location. Similarly to a Perceptron, a bias term $b$ is added to the weighted sum and it is scaled through a non-linear activation function ($\phi$). Typically, convolution kernels are activated through the rectified linear unit function (ReLu.

In [9]:

Image(filename='materials/images/free-use/Spatial-Convolution.png')

Out[9]:

Importantly, the distance between the applications of the kernel to the input image is called the stride. A stride of 1 pixel creates an activation map of the same size as the input image, while larger strides reduce the size of the activation map:

In [10]:

Image(filename='materials/images/free-use/Stride.png')

Out[10]:

Let's try this ourselves with some data:

In [11]:

def spatial_convolution_2D(image, kernel, bias=0):
    """spatial 2D convolution between an image and 
    a kernel with stride size 1
    """
    m, n = kernel.shape[:2]
    if (m == n):
        yin, xin = image.shape[:2]
        yout = yin - m + 1
        xout = xin - m + 1
        convolved_image = np.zeros((yout,xout))
        for i in range(yout):
            for j in range(xout):
                convolved_image[i,j] = np.sum(image[i:i+m, j:j+m]*kernel) + bias
    return convolved_image

Here, we will be using a few exemplary edge detection kernels (based on the Sobel Operator):

In [12]:

# a kernel that is sensitive to vertical edges:

vertical_edge_kernel = np.zeros((3,3))
vertical_edge_kernel[:,0] = 1
vertical_edge_kernel[:,2] = -1
vertical_edge_kernel[1,0] = 2
vertical_edge_kernel[1,2] = -2
vertical_edge_kernel

Out[12]:

array([[ 1.,  0., -1.],
       [ 2.,  0., -2.],
       [ 1.,  0., -1.]])

In [13]:

# a kernel that is sensitive to horizontal edges:

horizontal_edge_kernel = np.zeros((3,3))
horizontal_edge_kernel[0] = 1
horizontal_edge_kernel[2] = -1
horizontal_edge_kernel[0,1] = 2
horizontal_edge_kernel[2,1] = -2
horizontal_edge_kernel

Out[13]:

array([[ 1.,  2.,  1.],
       [ 0.,  0.,  0.],
       [-1., -2., -1.]])

In [14]:

# a kernel that is sensitive to all edges:

edge_detection_kernel = -1.0 * np.ones((3,3))
edge_detection_kernel[1,1] = 8
edge_detection_kernel

Out[14]:

array([[-1., -1., -1.],
       [-1.,  8., -1.],
       [-1., -1., -1.]])

What kind of activation maps do we get when we convolve these three kernels with a few examples of our dataset?

In [15]:

# make figure
fig, axs = plt.subplots(5,4,figsize=(12,12), dpi=80)

# drop axis in upper left corner
axs[0,0].remove()

# plot kernels
axs[0,1].imshow(vertical_edge_kernel, cmap='gray')
axs[0,2].imshow(horizontal_edge_kernel, cmap='gray')
axs[0,3].imshow(edge_detection_kernel, cmap='gray') 

# plot activation maps
for i in range(1,5):
    
    # input image
    axs[i,0].imshow(train_images[i,...,0], cmap='gray')

    # vertical
    convolved_img = spatial_convolution_2D(train_images[i,...,0], vertical_edge_kernel)
    axs[i,1].imshow(convolved_img, cmap='gray')
    
    # horizontal
    convolved_img = spatial_convolution_2D(train_images[i,...,0], horizontal_edge_kernel)
    axs[i,2].imshow(convolved_img, cmap='gray')
    
    # edge
    convolved_img = spatial_convolution_2D(train_images[i,...,0], edge_detection_kernel)
    axs[i,3].imshow(convolved_img, cmap='gray')

# remove ticks
for ax in axs.ravel():
    ax.set_xticks([])
    ax.set_yticks([])

# label axes
axs[1,0].set_title('Input')
axs[0,1].set_title('Vertical')
axs[0,2].set_title('Horizontal')
axs[0,3].set_title('Edge')

# save
fig.tight_layout()
fig.savefig('figures/Figure-2-1_Edge-Detection-Kernels.png', dpi=600)

As expected, the activation maps highlight the characteristics of the image that each kernel is sensitive to: vertical edges, horizontal edges, all edges!

2.3 Stacking many kernels to form a convolutional neural network¶

A CNN is nothing more than a sequence of convolution layers, which are each composed of a stack of convolution kernels!

Importantly, the kernels of each layer are applied to the activation maps resulting from the previous layer. This trick allows higher-level convolution kernels (which are deeper into the network) to learn very abstracted features, based on the activation maps of the preceding lower-level convolution kernels (for more details on this, see this amazing paper: https://distill.pub/2018/building-blocks/).

In [16]:

Image(filename='materials/images/free-use/Convolutional-Neural-Network.png')

Out[16]:

2.4 Pooling kernels¶

Classical CNN architectures contain one more type of kernel, which is used to down-sample the activation maps.

These kernels are called pooling kernels.

Two classical pooling kernels are average and max pooling.

They do nothing else then return the average or maximum value of the input in their receptive field.

Importantly, they are moved over the input in non-overlapping steps: A pooling kernel of size 2x2 would therefore be moved over the input in 2 pixel steps; thereby down-sampling the input by a factor of 2:

In [17]:

Image(filename='materials/images/free-use/Pooling-Kernels.png')

Out[17]:

2.5 Building a convolutional neural network with Keras and Tensorflow¶

Enough with the theory, let's build our first CNN!

To do this, we will be using one of the most widely used deep learning libraries: Tensorflow.

Tensorflow is a high-level Python library that makes it extremely easy to build and train state-of-the-art artificial neural network architectures.

Specifically, we will be using the Keras library that is part of Tensorflow.

In Keras, a deep learning model can be specified by the use of the Sequential model class.

Calling models.Sequential initiates an empty model.

We can then sequentially add layers to this model from Keras' layers module. Here, we will be focusing on the Conv2Dlayer type (representing a 2D-convolution layer) as well as AveragePooling2Dand Dense layers. Dense layers represent fully connected artificial neural network layers, such as the ones that we used in the previous notebook (1-Neural-Networks-Backpropagation.ipynb), each containing a set of individual Perceptrons.

For an overview of all available layer types, see this.

Here, we will re-create the classical LeNet-5 architecture.

Note that this architecture, in spite of its historical importance, is outdated! Nowadays, one would probably use smaller kernels, ReLu activations, and more convolution layers (instead of the three added fully connected layers; see below).

LeNet-5 includes three dense layers at the end of the network, with 120, 84, and 10 neurons each (see the code below). To allow for the addition of these layer, we need to flatten the activation maps resulting from the last convolution layer, such that they are each represented as a single row vector and can be processed by the dense layer:

In [18]:

import tensorflow as tf
from tensorflow.keras import layers, models, losses, Model

In [19]:

# initialize a sequential model
model = models.Sequential()

# add our first convolution layer
model.add(layers.Conv2D(filters=6, # this layer contains 6 kernels
                        kernel_size=(5, 5), # each being 5 x 5 values large
                        activation='tanh', # and activated through a ReLu activation function
                        input_shape=(28, 28, 1))) # this is the shape of the input

# now lets add a 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))   

# and then another convolution layer:
model.add(layers.Conv2D(filters=16, # this time containing 16 kernels
                        kernel_size=(5, 5), # each again with a size of 5x5 values
                        activation='tanh'))

# another 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2))) 

# and add dense output layers
model.add(layers.Flatten()) # flatten the activation maps of the last convolution layer
model.add(layers.Dense(120, activation='tanh'))
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax')) # one neuron for each of the 10 classes in our dataset

Ok, lets take a look at our full model, by calling model.summary():

In [20]:

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 24, 24, 6)         156       
_________________________________________________________________
average_pooling2d (AveragePo (None, 12, 12, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 8, 8, 16)          2416      
_________________________________________________________________
average_pooling2d_1 (Average (None, 4, 4, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 120)               30840     
_________________________________________________________________
dense_1 (Dense)              (None, 84)                10164     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                850       
=================================================================
Total params: 44,426
Trainable params: 44,426
Non-trainable params: 0
_________________________________________________________________

This is looking good.

We can see that the size of the activation maps is decreasing as the data is passed through the network.

We can also see that the activation maps of each convolution layer are stacked along the last dimension (look at the "Output Shape" column).

The flat activation maps resulting from our last convolution layer have a size of 288 values and the output layer contains 10 neurons (one for each class).

Overall, this model has almost 45,000 trainable parameters - mostly resulting from the first dense layer!

2.6 Training the model¶

Now that we have build our model, let's train it:

In Keras, this requires two steps:

First, we compile the model, by calling model.compile. Here, Keras builds a computational graph for the model in the background, automatically specifying all gradient computations (which we previously needed to specify by hand) and initializing all the weights.

In this step, we also specify the loss function that we want to minimize during training. Similar to our previous examples, we are using the cross entropy loss for multiple classes.

We further define the optimizer that we want to use to minimize our loss function: In our previous examples, we used a vanilla version of stochastic gradient descent. There exist, however, more sophisticated optimizers (for an overview, see here and here); We will use RMSprop (the default in Keras).

Lastly, we tell Keras to also track the predictive accuracy of our model during training (next to the loss; this solely means that Keras repeatedly computes the accuracy in the training and validation data during training and saves it; we do not optimize the predictive accuracy!)

In [21]:

np.random.seed(1312)
tf.random.set_seed(1312)
model.compile(optimizer='rmsprop',
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']) # tell Keras to also track the predictive accuracy of our model during training

Second, we specify everything else that is missing to actually perform stochastic gradient descent: e.g., the training and validation datasets, the batch size, and number of training epochs that we want to perform.

In Keras, an epoch is defined as an entire iteration over the training dataset:

In [22]:

np.random.seed(1312)
tf.random.set_seed(1312)
history = model.fit(x=train_images,
                    y=train_labels,
                    epochs=10, 
                    batch_size=32,
                    validation_data=(test_images, test_labels))

Epoch 1/10
1875/1875 [==============================] - 9s 5ms/step - loss: 1.6905 - accuracy: 0.7790 - val_loss: 1.6464 - val_accuracy: 0.8192
Epoch 2/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6288 - accuracy: 0.8347 - val_loss: 1.6279 - val_accuracy: 0.8347
Epoch 3/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6133 - accuracy: 0.8492 - val_loss: 1.6183 - val_accuracy: 0.8445
Epoch 4/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6032 - accuracy: 0.8593 - val_loss: 1.6119 - val_accuracy: 0.8520
Epoch 5/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5945 - accuracy: 0.8679 - val_loss: 1.6154 - val_accuracy: 0.8456
Epoch 6/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5883 - accuracy: 0.8741 - val_loss: 1.6059 - val_accuracy: 0.8546
Epoch 7/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5826 - accuracy: 0.8793 - val_loss: 1.6055 - val_accuracy: 0.8568
Epoch 8/10
1875/1875 [==============================] - 8s 4ms/step - loss: 1.5781 - accuracy: 0.8838 - val_loss: 1.5991 - val_accuracy: 0.8625
Epoch 9/10
1875/1875 [==============================] - 8s 4ms/step - loss: 1.5734 - accuracy: 0.8888 - val_loss: 1.5932 - val_accuracy: 0.8676
Epoch 10/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5699 - accuracy: 0.8921 - val_loss: 1.5887 - val_accuracy: 0.8733

Let's take a look at the training statistics:

In [23]:

# setup figure
fig, axs = plt.subplots(1,3,figsize=(20,6))

# plot training loss
axs[0].plot(history.history['accuracy'], label='Training data')
axs[0].plot(history.history['val_accuracy'], label = 'Test data')
axs[0].set_xlabel('Training epoch')
axs[0].set_ylabel('Accuracy')
axs[0].set_ylim([0, 1])
axs[0].legend(loc='lower right')
despine(ax=axs[0])

# plot confusion matrix for training and test datasets
for i, (label, X, y) in enumerate(zip(['Training', 'Test'],
                                      [train_images, test_images],
                                      [train_labels, test_labels])):
    y_pred = model.predict(X).argmax(axis=1)
    acc = np.mean(y_pred == y)
    axs[1+i].set_title('{} data\nMean Acc.: {}%'.format(label, np.round(acc*100, 2)))
    conf_mat = confusion_matrix(y, y_pred, normalize='true')
    sns.heatmap(np.round(conf_mat, 2), annot=True,
                ax=axs[1+i], vmin=0, vmax=1,
                annot_kws={'fontsize': 14})
    if i == 0:
        axs[1+i].set_ylabel('True label')
    axs[1+i].set_xlabel('Predicted label')

# save figure
fig.tight_layout()
fig.savefig('figures/Figure-2-2_Training-Stats.png', dpi=600)

Great! Our model is able to classify more than 80% of the images in the test dataset correctly (this value might vary on different machines due to difficulties with random initializations).

Note however that it is also slightly overfitting the training dataset (as indicated by the slightly higher predictive accuracy in the training data).

3. Your exercise:¶

Below you will find a copy of the code that we just ran to build and train our CNN.

Try adapting the specifications of the individual layers (e.g., reduce / increase the number of kernels or change the size of the kernels (e.g., to 3x3 or 7x7) or their activation funcion) and see how it changes the performance of the model.
You can also try out different optimizers (e.g., 'adam' or 'SGD'), batch sizes, and number of training epochs.
Lastly, you can also try to remove the pooling layers from the model and replace them with an increasing stride size of the convolution layers that are subsequent to them. An increase in stride size should have the same down-sampling effect as a pooling layer!

Be reasonable: try to avoid very large numbers of kernels per layer, more than 10 layers in the network, or a very large batch sizes. Otherwise you might crash the notebook that you are currently running; Chances are that you are running this with very limited resources on the Jupyter Binder Servers.

In [24]:

# 1. Model Specification:
# ----------------
# initialize a sequential model
model = models.Sequential()

# add our first convolution layer
model.add(layers.Conv2D(filters=6, # this layer contains 6 kernels
                        kernel_size=(5, 5), # each being 5 x 5 values large
                        activation='tanh', # and activated through a ReLu activation function
                        input_shape=(28, 28, 1))) # this is the shape of the input

# now lets add a 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))   

# and then another convolution layer:
model.add(layers.Conv2D(filters=16, # this time containing 16 kernels
                        kernel_size=(5, 5), # each again with a size of 5x5 values
                        activation='tanh'))

# another 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2))) 

# and add dense output layers
model.add(layers.Flatten()) # flatten the activation maps of the last convolution layer
model.add(layers.Dense(120, activation='tanh'))
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax')) # one neuron for each of the 10 classes in our dataset
# ----------------





# 2. Compiling the model:
# ----------------
np.random.seed(1312)
tf.random.set_seed(1312)
model.compile(optimizer='rmsprop',
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']) 
# ----------------





# 3. Training:
# ----------------
np.random.seed(1312)
tf.random.set_seed(1312)
history = model.fit(x=train_images,
                    y=train_labels,
                    epochs=10, 
                    batch_size=32,
                    validation_data=(test_images, test_labels))
# ----------------





# 4. Plot the results
# ----------------
# setup the figure
fig, axs = plt.subplots(1,3,figsize=(20,6))

# plot training loss
axs[0].plot(history.history['accuracy'], label='Training data')
axs[0].plot(history.history['val_accuracy'], label = 'Test data')
axs[0].set_xlabel('Training epoch')
axs[0].set_ylabel('Accuracy')
axs[0].set_ylim([0, 1])
axs[0].legend(loc='lower right')
despine(ax=axs[0])

# plot confusion matrix for training and test datasets
for i, (label, X, y) in enumerate(zip(['Training', 'Test'],
                                      [train_images, test_images],
                                      [train_labels, test_labels])):
    y_pred = model.predict(X).argmax(axis=1)
    acc = np.mean(y_pred == y)
    axs[1+i].set_title('{} data\nMean Acc.: {}%'.format(label, np.round(acc*100, 2)))
    conf_mat = confusion_matrix(y, y_pred, normalize='true')
    sns.heatmap(np.round(conf_mat, 2), annot=True,
                ax=axs[1+i], vmin=0, vmax=1,
                annot_kws={'fontsize': 14})
    if i == 0:
        axs[1+i].set_ylabel('True label')
    axs[1+i].set_xlabel('Predicted label')

# save figure
fig.tight_layout()
# ----------------

Epoch 1/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.7112 - accuracy: 0.7562 - val_loss: 1.6506 - val_accuracy: 0.8135
Epoch 2/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6352 - accuracy: 0.8279 - val_loss: 1.6332 - val_accuracy: 0.8285
Epoch 3/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6183 - accuracy: 0.8438 - val_loss: 1.6253 - val_accuracy: 0.8343
Epoch 4/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6066 - accuracy: 0.8554 - val_loss: 1.6181 - val_accuracy: 0.8418
Epoch 5/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.6001 - accuracy: 0.8618 - val_loss: 1.6327 - val_accuracy: 0.8277
Epoch 6/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5914 - accuracy: 0.8707 - val_loss: 1.6035 - val_accuracy: 0.8573
Epoch 7/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5850 - accuracy: 0.8766 - val_loss: 1.6000 - val_accuracy: 0.8603
Epoch 8/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5793 - accuracy: 0.8827 - val_loss: 1.5979 - val_accuracy: 0.8640
Epoch 9/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5757 - accuracy: 0.8863 - val_loss: 1.5916 - val_accuracy: 0.8703
Epoch 10/10
1875/1875 [==============================] - 7s 4ms/step - loss: 1.5719 - accuracy: 0.8897 - val_loss: 1.5856 - val_accuracy: 0.8756

4. One last bonus: activation maximization with deep dream¶

One last bonus for those of you who want to dig a bit deeper into the mechanics of convolutional neural networks:

Remember that I previously said that the convolution kernels respond to increasingly complex patterns in the data as you move further down into the network?

We can actually generate images that give us an insight into these patterns: To do this, we will use the deep dream algorithm. The code for this is buried in the deep_dream.py script, which you can find in the same directory as this notebook.

The basic idea of deep dream is very simple: We can turn the gradient descent procedure around, by optimizing an input image instead of the network parameters. Simply put, we are trying to find the image that maximally activates a specific kernel of the network (by performing gradient ascent).

For this analysis, we will use the VGG16 architecture, which was pre-trained on the ImageNet dataset.

In [25]:

vgg16 = tf.keras.applications.VGG16(include_top=False, weights='imagenet')
vgg16.summary()

Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None, None, 3)]   0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, None, None, 256)   295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, None, None, 256)   590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, None, None, 256)   590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, None, None, 256)   0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, None, None, 512)   1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, None, None, 512)   0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, None, None, 512)   0         
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________

We will then run the deep dream algorithm for three kernels of the layers 'block1_conv1', 'block2_conv2', 'block3_conv3', and 'block4_conv3'.

Importantly, I have pre-computed the output of this analysis. This analysis is computationally expensive, so I would not recommend running it on the Jupyter Binder Servers.

Instead, you can run it at home on your personal machine, if you like.

To do this, simply set the switch run_analysis to True..

In [26]:

run_analysis = False

In [27]:

if run_analysis:
    # load deep dream
    from helpers.deep_dream import run_deep_dream
   
    # set random seed
    np.random.seed(2135)
    tf.random.set_seed(2135)

    # setup figure
    fig, axs = plt.subplots(3,4,figsize=(20,15), dpi=200)

    # define random input image
    random_input = np.random.uniform(0,255,(200,200,3)).astype(np.int)

    # iterate three layers from VGG16
    for i, layer_name in enumerate(['block1_conv1', 'block2_conv2', 'block3_conv3', 'block4_conv3']):

        # Create the feature extraction model
        layer_output = vgg16.get_layer(layer_name).output
        for j, k in enumerate([0,10,35]):
            print('Processing layer: {}, kernel: {}'.format(layer_name, k))
            dream_model = tf.keras.Model(inputs=vgg16.input,
                                         outputs=layer_output[...,k])

            # run deep dream
            img = run_deep_dream(dream_model, random_input,
                                 show_progress=False)

            # plot
            axs[j,i].imshow(img.numpy())
            axs[j,i].set_title('Layer: {} (Kernel: {})'.format(layer_name, k))
            axs[j,i].set_xticks([])
            axs[j,i].set_yticks([])
            despine(ax=axs[j,i])

    # save
    fig.tight_layout()
    fig.savefig('figures/Figure-2-4_VGG16-Deep-Dream-Kernels.png', dpi=600)

In [28]:

Image(filename='figures/Figure-2-4_VGG16-Deep-Dream-Kernels.png')

Out[28]:

We can see that the resulting patterns, which maximally excite the individual kernels, get increasingly more abstract and complex the deeper we travel into the network (from left to right in the figure). Each row represents a the pattern that maximally excites the respective kernel of each layer.

Super cool, no?

Instead of maximizing the output of a single kernel, we could also maximize the output of entire layers:

In [29]:

run_analysis = False

In [30]:

if run_analysis:
    # import deep dream
    from helpers.deep_dream import run_deep_dream
    
    # set random seed
    np.random.seed(2135)
    tf.random.set_seed(2135)

    # setup figure
    fig, axs = plt.subplots(1,4,figsize=(20,6), dpi=200)

    # define random input image
    random_input = np.random.uniform(0,255,(200,200,3)).astype(np.int)

    # iterate three layers from VGG16
    for i, layer_name in enumerate(['block1_conv1', 'block2_conv2', 'block3_conv3', 'block4_conv3']):

        # Create the feature extraction model
        layer_output = vgg16.get_layer(layer_name).output
        dream_model = tf.keras.Model(inputs=vgg16.input,
                                     outputs=layer_output)

        # run deep dream
        img = run_deep_dream(dream_model, random_input,
                             show_progress=False)

        # plot
        axs[i].imshow(img.numpy())
        axs[i].set_title('Layer: {}'.format(layer_name, k))
        axs[i].set_xticks([])
        axs[i].set_yticks([])
        despine(ax=axs[i])

    # save
    fig.tight_layout()
    fig.savefig('figures/Figure-2-4_VGG16-Deep-Dream-Layers.png', dpi=600)

In [31]:

Image(filename='figures/Figure-2-4_VGG16-Deep-Dream-Layers.png')

Out[31]:

As expected, entire layers respond to even more complicated patterns; So far as that the last layer that we are plotting (on the far right) seems to be sensitive to very specific features (such as eyes?).