import numpy as np
from tqdm import tqdm
from sklearn.metrics import confusion_matrix
from seaborn import despine
import seaborn as sns
sns.set_style("ticks")
sns.set_context("talk")
from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline
The goal of this notebook is to introduce convolutional neural networks; A neural network architecture that is (loosely) inspired by the biological vision system and used to process visual image (or video) data (e.g., to identify the objects depicted in an image).
You will further get a basic introduction to Tensorflow and Keras; two of the most widely used deep learning libraries.
If you stick around, you will also learn a bit about Google's deep dream algorithm at the end of the notebook.
In this notebook, we will be using the Fashion-MNIST dataset.
This dataset contains 70,000 28x28 pixel grey-scale images of clothing pieces, encompassing a total of 10 different categories of clothing.
We can load the dataset with the fashion_mnist.load_data()
function from tensorflow.keras.datasets
.
(More details on Tensorflow and Keras will follow later).
from tensorflow.keras import datasets
# load Fashion MNIST data
(train_images, train_labels), (test_images, test_labels) = datasets.fashion_mnist.load_data()
# define class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
This dataset is already divided into a distinct training and test dataset, each containing 60000 and 10000 images:
train_images.shape, test_images.shape
((60000, 28, 28), (10000, 28, 28))
To make these images easily digestible for an artificial neural network, we will normalize the range of the pixel values to a range between 0 and 1:
# Normalize pixel values to be between 0 and 1
train_images, test_images = np.expand_dims(train_images / 255.0, -1), np.expand_dims(test_images / 255.0, -1)
fig, ax = plt.subplots(1,1,figsize=(6,6), dpi=50)
ax.hist(train_images.ravel())
ax.set_xlabel('Pixel value')
ax.set_ylabel('Frequency')
despine(ax=ax)
Ok, the normalization worked; The high frequency of 0 values is based on the consistent black background of each image.
Let's take a look at a few of the training examples:
fig, axs = plt.subplots(5,5,figsize=(12,12), dpi=100)
axs = axs.ravel()
for i, ax in enumerate(axs):
ax.imshow(train_images[i,...,0], cmap='gray')
ax.set_title(class_names[train_labels[i]])
ax.set_xticks([])
ax.set_yticks([])
fig.tight_layout()
fig.savefig('figures/Figure-2-0_Fashion-MNIST.png', dpi=600)
A very simple way to classify this dataset would be to flatten each sample to a vector of 784 values (28x28) and to then feed them to a fully-connected (or dense) artificial neural network (as we did with the handwritten digits in the previous notebook (1-Neural-Networks-Backpropagation.ipynb).
However, there is another common type of artificial neural network architecture that was specifically designed for this type of computer vision problem:
These networks are called convolutional neural networks (or CNNs) and are (very) loosely inspired by our knowledge of the neurobiological processes underlying the vision system.
On a superficial level, the brain processes visual information in the following steps (see the figure below):
First, a projection of the image is registered on the retina. The retina is composed of various types of neurons, each sensitive to specific characteristics of the perceived image. Each neuron further has a receptive field, restricting its sensitivity to specific area of the input image.
These retina neurons are connected to the optical nerve, which sends their output signals to the part of the visual system in the brain.
The visual system encompasses multiple brain regions (see for example the primary visual cortex, which each seem to respond to different patterns in (or aspects of) the visual input.
Importantly, lower-level brain regions of the visual system (which are closer to the visual input) are generally more sensitive to basic properties of the input (such as orientations, colors, or contrasts), while higher-level brain regions generally respond to more complex patterns (famous examples are the fusiform face area and parahippocampal place area which specifically respond to faces and scenes).
Image(filename='materials/images/free-use/Human-Vision.jpg')
Adapted from: Kubilius, Jonas (2017): Ventral visual stream. figshare. Figure. https://doi.org/10.6084/m9.figshare.106794.v3
Convolution kernels (or filters) represent the core building block of a convolutional neural networks. Conceptually, each kernel mimics a neuron of the visual system and responds to a specific pattern in the visual input.
Computationally, each kernel $k$ is composed of a small square matrix (typically with a size 3x3 or 5x5 values).
To obtain an activation map with a kernel (indicating whether the pattern to which the kernel responds is present at each location of the input), we can perform a spatial convolution between the kernel and image.
To do this, we simply move the kernel over the image and compute a weighted sum between the values of the kernel and the values of the underlying image at each location. Similarly to a Perceptron, a bias term $b$ is added to the weighted sum and it is scaled through a non-linear activation function ($\phi$). Typically, convolution kernels are activated through the rectified linear unit function (ReLu.
Image(filename='materials/images/free-use/Spatial-Convolution.png')
Importantly, the distance between the applications of the kernel to the input image is called the stride. A stride of 1 pixel creates an activation map of the same size as the input image, while larger strides reduce the size of the activation map:
Image(filename='materials/images/free-use/Stride.png')
Let's try this ourselves with some data:
def spatial_convolution_2D(image, kernel, bias=0):
"""spatial 2D convolution between an image and
a kernel with stride size 1
"""
m, n = kernel.shape[:2]
if (m == n):
yin, xin = image.shape[:2]
yout = yin - m + 1
xout = xin - m + 1
convolved_image = np.zeros((yout,xout))
for i in range(yout):
for j in range(xout):
convolved_image[i,j] = np.sum(image[i:i+m, j:j+m]*kernel) + bias
return convolved_image
Here, we will be using a few exemplary edge detection kernels (based on the Sobel Operator):
# a kernel that is sensitive to vertical edges:
vertical_edge_kernel = np.zeros((3,3))
vertical_edge_kernel[:,0] = 1
vertical_edge_kernel[:,2] = -1
vertical_edge_kernel[1,0] = 2
vertical_edge_kernel[1,2] = -2
vertical_edge_kernel
array([[ 1., 0., -1.], [ 2., 0., -2.], [ 1., 0., -1.]])
# a kernel that is sensitive to horizontal edges:
horizontal_edge_kernel = np.zeros((3,3))
horizontal_edge_kernel[0] = 1
horizontal_edge_kernel[2] = -1
horizontal_edge_kernel[0,1] = 2
horizontal_edge_kernel[2,1] = -2
horizontal_edge_kernel
array([[ 1., 2., 1.], [ 0., 0., 0.], [-1., -2., -1.]])
# a kernel that is sensitive to all edges:
edge_detection_kernel = -1.0 * np.ones((3,3))
edge_detection_kernel[1,1] = 8
edge_detection_kernel
array([[-1., -1., -1.], [-1., 8., -1.], [-1., -1., -1.]])
What kind of activation maps do we get when we convolve these three kernels with a few examples of our dataset?
# make figure
fig, axs = plt.subplots(5,4,figsize=(12,12), dpi=80)
# drop axis in upper left corner
axs[0,0].remove()
# plot kernels
axs[0,1].imshow(vertical_edge_kernel, cmap='gray')
axs[0,2].imshow(horizontal_edge_kernel, cmap='gray')
axs[0,3].imshow(edge_detection_kernel, cmap='gray')
# plot activation maps
for i in range(1,5):
# input image
axs[i,0].imshow(train_images[i,...,0], cmap='gray')
# vertical
convolved_img = spatial_convolution_2D(train_images[i,...,0], vertical_edge_kernel)
axs[i,1].imshow(convolved_img, cmap='gray')
# horizontal
convolved_img = spatial_convolution_2D(train_images[i,...,0], horizontal_edge_kernel)
axs[i,2].imshow(convolved_img, cmap='gray')
# edge
convolved_img = spatial_convolution_2D(train_images[i,...,0], edge_detection_kernel)
axs[i,3].imshow(convolved_img, cmap='gray')
# remove ticks
for ax in axs.ravel():
ax.set_xticks([])
ax.set_yticks([])
# label axes
axs[1,0].set_title('Input')
axs[0,1].set_title('Vertical')
axs[0,2].set_title('Horizontal')
axs[0,3].set_title('Edge')
# save
fig.tight_layout()
fig.savefig('figures/Figure-2-1_Edge-Detection-Kernels.png', dpi=600)
As expected, the activation maps highlight the characteristics of the image that each kernel is sensitive to: vertical edges, horizontal edges, all edges!
A CNN is nothing more than a sequence of convolution layers, which are each composed of a stack of convolution kernels!
Importantly, the kernels of each layer are applied to the activation maps resulting from the previous layer. This trick allows higher-level convolution kernels (which are deeper into the network) to learn very abstracted features, based on the activation maps of the preceding lower-level convolution kernels (for more details on this, see this amazing paper: https://distill.pub/2018/building-blocks/).
Image(filename='materials/images/free-use/Convolutional-Neural-Network.png')
Classical CNN architectures contain one more type of kernel, which is used to down-sample the activation maps.
These kernels are called pooling kernels.
Two classical pooling kernels are average and max pooling.
They do nothing else then return the average or maximum value of the input in their receptive field.
Importantly, they are moved over the input in non-overlapping steps: A pooling kernel of size 2x2 would therefore be moved over the input in 2 pixel steps; thereby down-sampling the input by a factor of 2:
Image(filename='materials/images/free-use/Pooling-Kernels.png')
Enough with the theory, let's build our first CNN!
To do this, we will be using one of the most widely used deep learning libraries: Tensorflow.
Tensorflow is a high-level Python library that makes it extremely easy to build and train state-of-the-art artificial neural network architectures.
Specifically, we will be using the Keras library that is part of Tensorflow.
In Keras, a deep learning model can be specified by the use of the Sequential
model class.
Calling models.Sequential
initiates an empty model.
We can then sequentially add layers to this model from Keras' layers
module. Here, we will be focusing on the Conv2D
layer type (representing a 2D-convolution layer) as well as AveragePooling2D
and Dense
layers. Dense layers represent fully connected artificial neural network layers, such as the ones that we used in the previous notebook (1-Neural-Networks-Backpropagation.ipynb), each containing a set of individual Perceptrons.
For an overview of all available layer types, see this.
Here, we will re-create the classical LeNet-5 architecture.
Note that this architecture, in spite of its historical importance, is outdated! Nowadays, one would probably use smaller kernels, ReLu activations, and more convolution layers (instead of the three added fully connected layers; see below).
LeNet-5 includes three dense layers at the end of the network, with 120, 84, and 10 neurons each (see the code below). To allow for the addition of these layer, we need to flatten the activation maps resulting from the last convolution layer, such that they are each represented as a single row vector and can be processed by the dense layer:
import tensorflow as tf
from tensorflow.keras import layers, models, losses, Model
# initialize a sequential model
model = models.Sequential()
# add our first convolution layer
model.add(layers.Conv2D(filters=6, # this layer contains 6 kernels
kernel_size=(5, 5), # each being 5 x 5 values large
activation='tanh', # and activated through a ReLu activation function
input_shape=(28, 28, 1))) # this is the shape of the input
# now lets add a 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))
# and then another convolution layer:
model.add(layers.Conv2D(filters=16, # this time containing 16 kernels
kernel_size=(5, 5), # each again with a size of 5x5 values
activation='tanh'))
# another 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))
# and add dense output layers
model.add(layers.Flatten()) # flatten the activation maps of the last convolution layer
model.add(layers.Dense(120, activation='tanh'))
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax')) # one neuron for each of the 10 classes in our dataset
Ok, lets take a look at our full model, by calling model.summary()
:
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 24, 24, 6) 156 _________________________________________________________________ average_pooling2d (AveragePo (None, 12, 12, 6) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 8, 8, 16) 2416 _________________________________________________________________ average_pooling2d_1 (Average (None, 4, 4, 16) 0 _________________________________________________________________ flatten (Flatten) (None, 256) 0 _________________________________________________________________ dense (Dense) (None, 120) 30840 _________________________________________________________________ dense_1 (Dense) (None, 84) 10164 _________________________________________________________________ dense_2 (Dense) (None, 10) 850 ================================================================= Total params: 44,426 Trainable params: 44,426 Non-trainable params: 0 _________________________________________________________________
This is looking good.
We can see that the size of the activation maps is decreasing as the data is passed through the network.
We can also see that the activation maps of each convolution layer are stacked along the last dimension (look at the "Output Shape" column).
The flat activation maps resulting from our last convolution layer have a size of 288 values and the output layer contains 10 neurons (one for each class).
Overall, this model has almost 45,000 trainable parameters - mostly resulting from the first dense layer!
Now that we have build our model, let's train it:
In Keras, this requires two steps:
First, we compile the model, by calling model.compile
. Here, Keras builds a computational graph for the model in the background, automatically specifying all gradient computations (which we previously needed to specify by hand) and initializing all the weights.
In this step, we also specify the loss function that we want to minimize during training. Similar to our previous examples, we are using the cross entropy loss for multiple classes.
We further define the optimizer that we want to use to minimize our loss function: In our previous examples, we used a vanilla version of stochastic gradient descent. There exist, however, more sophisticated optimizers (for an overview, see here and here); We will use RMSprop (the default in Keras).
Lastly, we tell Keras to also track the predictive accuracy of our model during training (next to the loss; this solely means that Keras repeatedly computes the accuracy in the training and validation data during training and saves it; we do not optimize the predictive accuracy!)
np.random.seed(1312)
tf.random.set_seed(1312)
model.compile(optimizer='rmsprop',
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']) # tell Keras to also track the predictive accuracy of our model during training
Second, we specify everything else that is missing to actually perform stochastic gradient descent: e.g., the training and validation datasets, the batch size, and number of training epochs that we want to perform.
In Keras, an epoch is defined as an entire iteration over the training dataset:
np.random.seed(1312)
tf.random.set_seed(1312)
history = model.fit(x=train_images,
y=train_labels,
epochs=10,
batch_size=32,
validation_data=(test_images, test_labels))
Epoch 1/10 1875/1875 [==============================] - 9s 5ms/step - loss: 1.6905 - accuracy: 0.7790 - val_loss: 1.6464 - val_accuracy: 0.8192 Epoch 2/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6288 - accuracy: 0.8347 - val_loss: 1.6279 - val_accuracy: 0.8347 Epoch 3/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6133 - accuracy: 0.8492 - val_loss: 1.6183 - val_accuracy: 0.8445 Epoch 4/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6032 - accuracy: 0.8593 - val_loss: 1.6119 - val_accuracy: 0.8520 Epoch 5/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5945 - accuracy: 0.8679 - val_loss: 1.6154 - val_accuracy: 0.8456 Epoch 6/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5883 - accuracy: 0.8741 - val_loss: 1.6059 - val_accuracy: 0.8546 Epoch 7/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5826 - accuracy: 0.8793 - val_loss: 1.6055 - val_accuracy: 0.8568 Epoch 8/10 1875/1875 [==============================] - 8s 4ms/step - loss: 1.5781 - accuracy: 0.8838 - val_loss: 1.5991 - val_accuracy: 0.8625 Epoch 9/10 1875/1875 [==============================] - 8s 4ms/step - loss: 1.5734 - accuracy: 0.8888 - val_loss: 1.5932 - val_accuracy: 0.8676 Epoch 10/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5699 - accuracy: 0.8921 - val_loss: 1.5887 - val_accuracy: 0.8733
Let's take a look at the training statistics:
# setup figure
fig, axs = plt.subplots(1,3,figsize=(20,6))
# plot training loss
axs[0].plot(history.history['accuracy'], label='Training data')
axs[0].plot(history.history['val_accuracy'], label = 'Test data')
axs[0].set_xlabel('Training epoch')
axs[0].set_ylabel('Accuracy')
axs[0].set_ylim([0, 1])
axs[0].legend(loc='lower right')
despine(ax=axs[0])
# plot confusion matrix for training and test datasets
for i, (label, X, y) in enumerate(zip(['Training', 'Test'],
[train_images, test_images],
[train_labels, test_labels])):
y_pred = model.predict(X).argmax(axis=1)
acc = np.mean(y_pred == y)
axs[1+i].set_title('{} data\nMean Acc.: {}%'.format(label, np.round(acc*100, 2)))
conf_mat = confusion_matrix(y, y_pred, normalize='true')
sns.heatmap(np.round(conf_mat, 2), annot=True,
ax=axs[1+i], vmin=0, vmax=1,
annot_kws={'fontsize': 14})
if i == 0:
axs[1+i].set_ylabel('True label')
axs[1+i].set_xlabel('Predicted label')
# save figure
fig.tight_layout()
fig.savefig('figures/Figure-2-2_Training-Stats.png', dpi=600)
Great! Our model is able to classify more than 80% of the images in the test dataset correctly (this value might vary on different machines due to difficulties with random initializations).
Note however that it is also slightly overfitting the training dataset (as indicated by the slightly higher predictive accuracy in the training data).
Below you will find a copy of the code that we just ran to build and train our CNN.
Try adapting the specifications of the individual layers (e.g., reduce / increase the number of kernels or change the size of the kernels (e.g., to 3x3 or 7x7) or their activation funcion) and see how it changes the performance of the model.
You can also try out different optimizers (e.g., 'adam' or 'SGD'), batch sizes, and number of training epochs.
Lastly, you can also try to remove the pooling layers from the model and replace them with an increasing stride size of the convolution layers that are subsequent to them. An increase in stride size should have the same down-sampling effect as a pooling layer!
Be reasonable: try to avoid very large numbers of kernels per layer, more than 10 layers in the network, or a very large batch sizes. Otherwise you might crash the notebook that you are currently running; Chances are that you are running this with very limited resources on the Jupyter Binder Servers.
# 1. Model Specification:
# ----------------
# initialize a sequential model
model = models.Sequential()
# add our first convolution layer
model.add(layers.Conv2D(filters=6, # this layer contains 6 kernels
kernel_size=(5, 5), # each being 5 x 5 values large
activation='tanh', # and activated through a ReLu activation function
input_shape=(28, 28, 1))) # this is the shape of the input
# now lets add a 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))
# and then another convolution layer:
model.add(layers.Conv2D(filters=16, # this time containing 16 kernels
kernel_size=(5, 5), # each again with a size of 5x5 values
activation='tanh'))
# another 2x2 average pooling layer:
model.add(layers.AveragePooling2D(pool_size=(2,2)))
# and add dense output layers
model.add(layers.Flatten()) # flatten the activation maps of the last convolution layer
model.add(layers.Dense(120, activation='tanh'))
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax')) # one neuron for each of the 10 classes in our dataset
# ----------------
# 2. Compiling the model:
# ----------------
np.random.seed(1312)
tf.random.set_seed(1312)
model.compile(optimizer='rmsprop',
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
# ----------------
# 3. Training:
# ----------------
np.random.seed(1312)
tf.random.set_seed(1312)
history = model.fit(x=train_images,
y=train_labels,
epochs=10,
batch_size=32,
validation_data=(test_images, test_labels))
# ----------------
# 4. Plot the results
# ----------------
# setup the figure
fig, axs = plt.subplots(1,3,figsize=(20,6))
# plot training loss
axs[0].plot(history.history['accuracy'], label='Training data')
axs[0].plot(history.history['val_accuracy'], label = 'Test data')
axs[0].set_xlabel('Training epoch')
axs[0].set_ylabel('Accuracy')
axs[0].set_ylim([0, 1])
axs[0].legend(loc='lower right')
despine(ax=axs[0])
# plot confusion matrix for training and test datasets
for i, (label, X, y) in enumerate(zip(['Training', 'Test'],
[train_images, test_images],
[train_labels, test_labels])):
y_pred = model.predict(X).argmax(axis=1)
acc = np.mean(y_pred == y)
axs[1+i].set_title('{} data\nMean Acc.: {}%'.format(label, np.round(acc*100, 2)))
conf_mat = confusion_matrix(y, y_pred, normalize='true')
sns.heatmap(np.round(conf_mat, 2), annot=True,
ax=axs[1+i], vmin=0, vmax=1,
annot_kws={'fontsize': 14})
if i == 0:
axs[1+i].set_ylabel('True label')
axs[1+i].set_xlabel('Predicted label')
# save figure
fig.tight_layout()
# ----------------
Epoch 1/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.7112 - accuracy: 0.7562 - val_loss: 1.6506 - val_accuracy: 0.8135 Epoch 2/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6352 - accuracy: 0.8279 - val_loss: 1.6332 - val_accuracy: 0.8285 Epoch 3/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6183 - accuracy: 0.8438 - val_loss: 1.6253 - val_accuracy: 0.8343 Epoch 4/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6066 - accuracy: 0.8554 - val_loss: 1.6181 - val_accuracy: 0.8418 Epoch 5/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.6001 - accuracy: 0.8618 - val_loss: 1.6327 - val_accuracy: 0.8277 Epoch 6/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5914 - accuracy: 0.8707 - val_loss: 1.6035 - val_accuracy: 0.8573 Epoch 7/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5850 - accuracy: 0.8766 - val_loss: 1.6000 - val_accuracy: 0.8603 Epoch 8/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5793 - accuracy: 0.8827 - val_loss: 1.5979 - val_accuracy: 0.8640 Epoch 9/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5757 - accuracy: 0.8863 - val_loss: 1.5916 - val_accuracy: 0.8703 Epoch 10/10 1875/1875 [==============================] - 7s 4ms/step - loss: 1.5719 - accuracy: 0.8897 - val_loss: 1.5856 - val_accuracy: 0.8756
One last bonus for those of you who want to dig a bit deeper into the mechanics of convolutional neural networks:
Remember that I previously said that the convolution kernels respond to increasingly complex patterns in the data as you move further down into the network?
We can actually generate images that give us an insight into these patterns: To do this, we will use the deep dream algorithm. The code for this is buried in the deep_dream.py
script, which you can find in the same directory as this notebook.
The basic idea of deep dream is very simple: We can turn the gradient descent procedure around, by optimizing an input image instead of the network parameters. Simply put, we are trying to find the image that maximally activates a specific kernel of the network (by performing gradient ascent).
For this analysis, we will use the VGG16 architecture, which was pre-trained on the ImageNet dataset.
vgg16 = tf.keras.applications.VGG16(include_top=False, weights='imagenet')
vgg16.summary()
Model: "vgg16" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, None, None, 3)] 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, None, None, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, None, None, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, None, None, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, None, None, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, None, None, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, None, None, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, None, None, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, None, None, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, None, None, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, None, None, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, None, None, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, None, None, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, None, None, 512) 0 ================================================================= Total params: 14,714,688 Trainable params: 14,714,688 Non-trainable params: 0 _________________________________________________________________
We will then run the deep dream algorithm for three kernels of the layers 'block1_conv1', 'block2_conv2', 'block3_conv3', and 'block4_conv3'.
Importantly, I have pre-computed the output of this analysis. This analysis is computationally expensive, so I would not recommend running it on the Jupyter Binder Servers.
Instead, you can run it at home on your personal machine, if you like.
To do this, simply set the switch run_analysis
to True
..
run_analysis = False
if run_analysis:
# load deep dream
from helpers.deep_dream import run_deep_dream
# set random seed
np.random.seed(2135)
tf.random.set_seed(2135)
# setup figure
fig, axs = plt.subplots(3,4,figsize=(20,15), dpi=200)
# define random input image
random_input = np.random.uniform(0,255,(200,200,3)).astype(np.int)
# iterate three layers from VGG16
for i, layer_name in enumerate(['block1_conv1', 'block2_conv2', 'block3_conv3', 'block4_conv3']):
# Create the feature extraction model
layer_output = vgg16.get_layer(layer_name).output
for j, k in enumerate([0,10,35]):
print('Processing layer: {}, kernel: {}'.format(layer_name, k))
dream_model = tf.keras.Model(inputs=vgg16.input,
outputs=layer_output[...,k])
# run deep dream
img = run_deep_dream(dream_model, random_input,
show_progress=False)
# plot
axs[j,i].imshow(img.numpy())
axs[j,i].set_title('Layer: {} (Kernel: {})'.format(layer_name, k))
axs[j,i].set_xticks([])
axs[j,i].set_yticks([])
despine(ax=axs[j,i])
# save
fig.tight_layout()
fig.savefig('figures/Figure-2-4_VGG16-Deep-Dream-Kernels.png', dpi=600)
Image(filename='figures/Figure-2-4_VGG16-Deep-Dream-Kernels.png')
We can see that the resulting patterns, which maximally excite the individual kernels, get increasingly more abstract and complex the deeper we travel into the network (from left to right in the figure). Each row represents a the pattern that maximally excites the respective kernel of each layer.
Super cool, no?
Instead of maximizing the output of a single kernel, we could also maximize the output of entire layers:
run_analysis = False
if run_analysis:
# import deep dream
from helpers.deep_dream import run_deep_dream
# set random seed
np.random.seed(2135)
tf.random.set_seed(2135)
# setup figure
fig, axs = plt.subplots(1,4,figsize=(20,6), dpi=200)
# define random input image
random_input = np.random.uniform(0,255,(200,200,3)).astype(np.int)
# iterate three layers from VGG16
for i, layer_name in enumerate(['block1_conv1', 'block2_conv2', 'block3_conv3', 'block4_conv3']):
# Create the feature extraction model
layer_output = vgg16.get_layer(layer_name).output
dream_model = tf.keras.Model(inputs=vgg16.input,
outputs=layer_output)
# run deep dream
img = run_deep_dream(dream_model, random_input,
show_progress=False)
# plot
axs[i].imshow(img.numpy())
axs[i].set_title('Layer: {}'.format(layer_name, k))
axs[i].set_xticks([])
axs[i].set_yticks([])
despine(ax=axs[i])
# save
fig.tight_layout()
fig.savefig('figures/Figure-2-4_VGG16-Deep-Dream-Layers.png', dpi=600)
Image(filename='figures/Figure-2-4_VGG16-Deep-Dream-Layers.png')
As expected, entire layers respond to even more complicated patterns; So far as that the last layer that we are plotting (on the far right) seems to be sensitive to very specific features (such as eyes?).