import keras
keras.__version__
Using TensorFlow backend.
'2.0.8'
This notebook contains the code samples found in Chapter 6, Section 4 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.
In Keras, you would use a 1D convnet via the Conv1D
layer, which has a very similar interface to Conv2D
. It takes as input 3D tensors
with shape (samples, time, features)
and also returns similarly-shaped 3D tensors. The convolution window is a 1D window on the temporal
axis, axis 1 in the input tensor.
Let's build a simple 2-layer 1D convnet and apply it to the IMDB sentiment classification task that you are already familiar with.
As a reminder, this is the code for obtaining and preprocessing the data:
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000 # number of words to consider as features
max_len = 500 # cut texts after this number of words (among top max_features most common words)
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
Loading data... 25000 train sequences 25000 test sequences Pad sequences (samples x time) x_train shape: (25000, 500) x_test shape: (25000, 500)
1D convnets are structured in the same way as their 2D counter-parts that you have used in Chapter 5: they consist of a stack of Conv1D
and MaxPooling1D
layers, eventually ending in either a global pooling layer or a Flatten
layer, turning the 3D outputs into 2D outputs,
allowing to add one or more Dense
layers to the model, for classification or regression.
One difference, though, is the fact that we can afford to use larger convolution windows with 1D convnets. Indeed, with a 2D convolution layer, a 3x3 convolution window contains 3*3 = 9 feature vectors, but with a 1D convolution layer, a convolution window of size 3 would only contain 3 feature vectors. We can thus easily afford 1D convolution windows of size 7 or 9.
This is our example 1D convnet for the IMDB dataset:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
model.summary()
model.compile(optimizer=RMSprop(lr=1e-4),
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=128,
validation_split=0.2)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 500, 128) 1280000 _________________________________________________________________ conv1d_1 (Conv1D) (None, 494, 32) 28704 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 98, 32) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 92, 32) 7200 _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 32) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 1,315,937 Trainable params: 1,315,937 Non-trainable params: 0 _________________________________________________________________ Train on 20000 samples, validate on 5000 samples Epoch 1/10 20000/20000 [==============================] - 4s - loss: 0.7713 - acc: 0.5287 - val_loss: 0.6818 - val_acc: 0.5970 Epoch 2/10 20000/20000 [==============================] - 3s - loss: 0.6631 - acc: 0.6775 - val_loss: 0.6582 - val_acc: 0.6646 Epoch 3/10 20000/20000 [==============================] - 3s - loss: 0.6142 - acc: 0.7580 - val_loss: 0.5987 - val_acc: 0.7118 Epoch 4/10 20000/20000 [==============================] - 3s - loss: 0.5156 - acc: 0.8124 - val_loss: 0.4936 - val_acc: 0.7736 Epoch 5/10 20000/20000 [==============================] - 3s - loss: 0.4029 - acc: 0.8469 - val_loss: 0.4123 - val_acc: 0.8358 Epoch 6/10 20000/20000 [==============================] - 3s - loss: 0.3455 - acc: 0.8653 - val_loss: 0.4040 - val_acc: 0.8382 Epoch 7/10 20000/20000 [==============================] - 3s - loss: 0.3078 - acc: 0.8634 - val_loss: 0.4059 - val_acc: 0.8240 Epoch 8/10 20000/20000 [==============================] - 3s - loss: 0.2812 - acc: 0.8535 - val_loss: 0.4147 - val_acc: 0.8098 Epoch 9/10 20000/20000 [==============================] - 3s - loss: 0.2554 - acc: 0.8334 - val_loss: 0.4296 - val_acc: 0.7878 Epoch 10/10 20000/20000 [==============================] - 3s - loss: 0.2356 - acc: 0.8052 - val_loss: 0.4296 - val_acc: 0.7600
Here are our training and validation results: validation accuracy is somewhat lower than that of the LSTM we used two sections ago, but runtime is faster, both on CPU and GPU (albeit the exact speedup will vary greatly depending on your exact configuration). At that point, we could re-train this model for the right number of epochs (8), and run it on the test set. This is a convincing demonstration that a 1D convnet can offer a fast, cheap alternative to a recurrent network on a word-level sentiment classification task.
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Because 1D convnets process input patches independently, they are not sensitive to the order of the timesteps (beyond a local scale, the size of the convolution windows), unlike RNNs. Of course, in order to be able to recognize longer-term patterns, one could stack many convolution layers and pooling layers, resulting in upper layers that would "see" long chunks of the original inputs -- but that's still a fairly weak way to induce order-sensitivity. One way to evidence this weakness is to try 1D convnets on the temperature forecasting problem from the previous section, where order-sensitivity was key to produce good predictions. Let's see:
# We reuse the following variables defined in the last section:
# float_data, train_gen, val_gen, val_steps
import os
import numpy as np
data_dir = '/home/ubuntu/data/'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')
f = open(fname)
data = f.read()
f.close()
lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]
float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(',')[1:]]
float_data[i, :] = values
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
def generator(data, lookback, delay, min_index, max_index,
shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples, targets
lookback = 1440
step = 6
delay = 144
batch_size = 128
train_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=0,
max_index=200000,
shuffle=True,
step=step,
batch_size=batch_size)
val_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=200001,
max_index=300000,
step=step,
batch_size=batch_size)
test_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=300001,
max_index=None,
step=step,
batch_size=batch_size)
# This is how many steps to draw from `val_gen`
# in order to see the whole validation set:
val_steps = (300000 - 200001 - lookback) // batch_size
# This is how many steps to draw from `test_gen`
# in order to see the whole test set:
test_steps = (len(float_data) - 300001 - lookback) // batch_size
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Conv1D(32, 5, activation='relu',
input_shape=(None, float_data.shape[-1])))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen,
validation_steps=val_steps)
Epoch 1/20 500/500 [==============================] - 124s - loss: 0.4189 - val_loss: 0.4521 Epoch 2/20 500/500 [==============================] - 11s - loss: 0.3629 - val_loss: 0.4545 Epoch 3/20 500/500 [==============================] - 11s - loss: 0.3399 - val_loss: 0.4527 Epoch 4/20 500/500 [==============================] - 11s - loss: 0.3229 - val_loss: 0.4721 Epoch 5/20 500/500 [==============================] - 11s - loss: 0.3122 - val_loss: 0.4712 Epoch 6/20 500/500 [==============================] - 11s - loss: 0.3030 - val_loss: 0.4705 Epoch 7/20 500/500 [==============================] - 11s - loss: 0.2935 - val_loss: 0.4870 Epoch 8/20 500/500 [==============================] - 11s - loss: 0.2862 - val_loss: 0.4676 Epoch 9/20 500/500 [==============================] - 11s - loss: 0.2817 - val_loss: 0.4738 Epoch 10/20 500/500 [==============================] - 11s - loss: 0.2775 - val_loss: 0.4896 Epoch 11/20 500/500 [==============================] - 11s - loss: 0.2715 - val_loss: 0.4765 Epoch 12/20 500/500 [==============================] - 11s - loss: 0.2683 - val_loss: 0.4724 Epoch 13/20 500/500 [==============================] - 11s - loss: 0.2644 - val_loss: 0.4842 Epoch 14/20 500/500 [==============================] - 11s - loss: 0.2606 - val_loss: 0.4910 Epoch 15/20 500/500 [==============================] - 11s - loss: 0.2558 - val_loss: 0.5000 Epoch 16/20 500/500 [==============================] - 11s - loss: 0.2539 - val_loss: 0.4960 Epoch 17/20 500/500 [==============================] - 11s - loss: 0.2516 - val_loss: 0.4875 Epoch 18/20 500/500 [==============================] - 11s - loss: 0.2501 - val_loss: 0.4884 Epoch 19/20 500/500 [==============================] - 11s - loss: 0.2444 - val_loss: 0.5024 Epoch 20/20 500/500 [==============================] - 11s - loss: 0.2444 - val_loss: 0.4821
Here are our training and validation Mean Absolute Errors:
import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(loss))
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
The validation MAE stays in the low 0.40s: we cannot even beat our common-sense baseline using the small convnet. Again, this is because our convnet looks for patterns anywhere in the input timeseries, and has no knowledge of the temporal position of a pattern it sees (e.g. towards the beginning, towards the end, etc.). Since more recent datapoints should be interpreted differently from older datapoints in the case of this specific forecasting problem, the convnet fails at producing meaningful results here. This limitation of convnets was not an issue on IMDB, because patterns of keywords that are associated with a positive or a negative sentiment will be informative independently of where they are found in the input sentences.
One strategy to combine the speed and lightness of convnets with the order-sensitivity of RNNs is to use a 1D convnet as a preprocessing step before a RNN. This is especially beneficial when dealing with sequences that are so long that they couldn't realistically be processed with RNNs, e.g. sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.
This technique is not seen very often in research papers and practical applications, possibly because it is not very well known. It is very
effective and ought to be more common. Let's try this out on the temperature forecasting dataset. Because this strategy allows us to
manipulate much longer sequences, we could either look at data from further back (by increasing the lookback
parameter of the data
generator), or look at high-resolution timeseries (by decreasing the step
parameter of the generator). Here, we will chose (somewhat
arbitrarily) to use a step
twice smaller, resulting in twice longer timeseries, where the weather data is being sampled at a rate of one
point per 30 minutes.
# This was previously set to 6 (one point per hour).
# Now 3 (one point per 30 min).
step = 3
lookback = 720 # Unchanged
delay = 144 # Unchanged
train_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=0,
max_index=200000,
shuffle=True,
step=step)
val_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=200001,
max_index=300000,
step=step)
test_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=300001,
max_index=None,
step=step)
val_steps = (300000 - 200001 - lookback) // 128
test_steps = (len(float_data) - 300001 - lookback) // 128
This is our model, starting with two Conv1D
layers and following-up with a GRU
layer:
model = Sequential()
model.add(layers.Conv1D(32, 5, activation='relu',
input_shape=(None, float_data.shape[-1])))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5))
model.add(layers.Dense(1))
model.summary()
model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen,
validation_steps=val_steps)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv1d_6 (Conv1D) (None, None, 32) 2272 _________________________________________________________________ max_pooling1d_4 (MaxPooling1 (None, None, 32) 0 _________________________________________________________________ conv1d_7 (Conv1D) (None, None, 32) 5152 _________________________________________________________________ gru_1 (GRU) (None, 32) 6240 _________________________________________________________________ dense_3 (Dense) (None, 1) 33 ================================================================= Total params: 13,697 Trainable params: 13,697 Non-trainable params: 0 _________________________________________________________________ Epoch 1/20 500/500 [==============================] - 60s - loss: 0.3387 - val_loss: 0.3030 Epoch 2/20 500/500 [==============================] - 58s - loss: 0.3055 - val_loss: 0.2864 Epoch 3/20 500/500 [==============================] - 58s - loss: 0.2904 - val_loss: 0.2841 Epoch 4/20 500/500 [==============================] - 58s - loss: 0.2830 - val_loss: 0.2730 Epoch 5/20 500/500 [==============================] - 58s - loss: 0.2767 - val_loss: 0.2757 Epoch 6/20 500/500 [==============================] - 58s - loss: 0.2696 - val_loss: 0.2819 Epoch 7/20 500/500 [==============================] - 57s - loss: 0.2642 - val_loss: 0.2787 Epoch 8/20 500/500 [==============================] - 57s - loss: 0.2595 - val_loss: 0.2920 Epoch 9/20 500/500 [==============================] - 58s - loss: 0.2546 - val_loss: 0.2919 Epoch 10/20 500/500 [==============================] - 57s - loss: 0.2506 - val_loss: 0.2772 Epoch 11/20 500/500 [==============================] - 58s - loss: 0.2459 - val_loss: 0.2801 Epoch 12/20 500/500 [==============================] - 57s - loss: 0.2433 - val_loss: 0.2807 Epoch 13/20 500/500 [==============================] - 58s - loss: 0.2411 - val_loss: 0.2855 Epoch 14/20 500/500 [==============================] - 58s - loss: 0.2360 - val_loss: 0.2858 Epoch 15/20 500/500 [==============================] - 57s - loss: 0.2350 - val_loss: 0.2834 Epoch 16/20 500/500 [==============================] - 58s - loss: 0.2315 - val_loss: 0.2917 Epoch 17/20 500/500 [==============================] - 60s - loss: 0.2285 - val_loss: 0.2944 Epoch 18/20 500/500 [==============================] - 57s - loss: 0.2280 - val_loss: 0.2923 Epoch 19/20 500/500 [==============================] - 57s - loss: 0.2249 - val_loss: 0.2910 Epoch 20/20 500/500 [==============================] - 57s - loss: 0.2215 - val_loss: 0.2952
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(loss))
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f707ade8128>
Judging from the validation loss, this setup is not quite as good as the regularized GRU alone, but it's significantly faster. It is looking at twice more data, which in this case doesn't appear to be hugely helpful, but may be important for other datasets.
Here's what you should take away from this section:
temporal patterns. They offer a faster alternative to RNNs on some problems, in particular NLP tasks.
Conv1D
layers and MaxPooling1D
layers, eventually ending in a global pooling operation or flattening operation.
convnet as a preprocessing step before a RNN, shortening the sequence and extracting useful representations for the RNN to process.
One useful and important concept that we will not cover in these pages is that of 1D convolution with dilated kernels.