:label:sec_dropout
In :numref:sec_weight_decay
,
we introduced the classical approach
to regularizing statistical models
by penalizing the $\ell_2$ norm of the weights.
In probabilistic terms, we could justify this technique
by arguing that we have assumed a prior belief
that weights take values from
a Gaussian distribution with mean $0$.
More intuitively, we might argue
that we encouraged the model to spread out its weights
among many features rather than depending too much
on a small number of potentially spurious associations.
Faced with more features than examples, linear models tend to overfit. But given more examples than features, we can generally count on linear models not to overfit. Unfortunately, the reliability with which linear models generalize comes at a cost: Naively applied, linear models do not take into account interactions among features. For every feature, a linear model must assign either a positive or a negative weight, ignoring context.
In traditional texts, this fundamental tension between generalizability and flexibility is described as the bias-variance tradeoff. Linear models have high bias (they can only represent a small class of functions), but low variance (they give similar results across different random samples of the data).
Deep neural networks inhabit the opposite end of the bias-variance spectrum. Unlike linear models, neural networks are not confined to looking at each feature individually. They can learn interactions among groups of features. For example, they might infer that “Nigeria” and “Western Union” appearing together in an email indicates spam but that separately they do not.
Even when we have far more examples than features, deep neural networks are capable of overfitting. In 2017, a group of researchers demonstrated the extreme flexibility of neural networks by training deep nets on randomly-labeled images. Despite the absence of any true pattern linking the inputs to the outputs, they found that the neural network optimized by SGD could label every image in the training set perfectly.
Consider what this means. If the labels are assigned uniformly at random and there are 10 classes, then no classifier can do better than 10% accuracy on holdout data. The generalization gap here is a whopping 90%. If our models are so expressive that they can overfit this badly, then when should we expect them not to overfit? The mathematical foundations for the puzzling generalization properties of deep networks remain open research questions, and we encourage the theoretically-oriented reader to dig deeper into the topic. For now, we turn to the more terrestrial investigation of practical tools that tend (empirically) to improve the generalization of deep nets.
Let us think briefly about what we
expect from a good predictive model.
We want it to peform well on unseen data.
Classical generalization theory
suggests that to close the gap between
train and test performance,
we should aim for a simple model.
Simplicity can come in the form
of a small number of dimensions.
We explored this when discussing the
monomial basis functions of linear models
:numref:sec_model_selection
.
Additionally, as we saw when discussing weight decay
($\ell_2$ regularization) :numref:sec_weight_decay
,
the (inverse) norm of the parameters
represents another useful measure of simplicity.
Another useful notion of simplicity is smoothness,
i.e., that the function should not be sensitive
to small changes to its inputs.
For instance, when we classify images,
we would expect that adding some random noise
to the pixels should be mostly harmless.
In 1995, Christopher Bishop formalized
this idea when he proved that training with input noise
is equivalent to Tikhonov regularization :cite:Bishop.1995
.
This work drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple),
and the requirement that it be resilient
to perturbations in the input.
Then, in 2014, Srivastava et al. :cite:Srivastava.Hinton.Krizhevsky.ea.2014
developed a clever idea for how to apply Bishop's idea
to the internal layers of the network, too.
Namely, they proposed to inject noise
into each layer of the network
before calculating the subsequent layer during training.
They realized that when training
a deep network with many layers,
injecting noise enforces smoothness just on the input-output mapping.
Their idea, called dropout, involves injecting noise while computing each internal layer during forward propagation, and it has become a standard technique for training neural networks. The method is called dropout because we literally drop out some neurons during training. Throughout training, on each iteration, standard dropout consists of zeroing out some fraction (typically 50%) of the nodes in each layer before calculating the subsequent layer.
To be clear, we are imposing our own narrative with the link to Bishop. The original paper on dropout offers intuition through a surprising analogy to sexual reproduction. The authors argue that neural network overfitting is characterized by a state in which each layer relies on a specifc pattern of activations in the previous layer, calling this condition co-adaptation. Dropout, they claim, breaks up co-adaptation just as sexual reproduction is argued to break up co-adapted genes.
The key challenge then is how to inject this noise. One idea is to inject the noise in an unbiased manner so that the expected value of each layer---while fixing the others---equals to the value it would have taken absent noise.
In Bishop's work, he added Gaussian noise to the inputs to a linear model: At each training iteration, he added noise sampled from a distribution with mean zero $\epsilon \sim \mathcal{N}(0,\sigma^2)$ to the input $\mathbf{x}$, yielding a perturbed point $\mathbf{x}' = \mathbf{x} + \epsilon$. In expectation, $E[\mathbf{x}'] = \mathbf{x}$.
In standard dropout regularization, one debiases each layer by normalizing by the fraction of nodes that were retained (not dropped out). In other words, dropout with dropout probability $p$ is applied as follows:
$$ \begin{aligned} h' = \begin{cases} 0 & \text{ with probability } p \\ \frac{h}{1-p} & \text{ otherwise} \end{cases} \end{aligned} $$By design, the expectation remains unchanged, i.e., $E[h'] = h$. Intermediate activations $h$ are replaced by a random variable $h'$ with matching expectation.
Recall the multilayer perceptron (:numref:sec_mlp
)
with a hidden layer and 5 hidden units.
Its architecture is given by
When we apply dropout to a hidden layer,
zeroing out each hidden unit with probability $p$,
the result can be viewed as a network
containing only a subset of the original neurons.
In :numref:fig_dropout2
, $h_2$ and $h_5$ are removed.
Consequently, the calculation of $y$
no longer depends on $h_2$ and $h_5$
and their respective gradient also vanishes
when performing backprop.
In this way, the calculation of the output layer
cannot be overly dependent on any
one element of $h_1, \ldots, h_5$.
:label:fig_dropout2
Typically, *we disable dropout at test time. Given a trained model and a new example, we do not drop out any nodes (and thus do not need to normalize). However, there are some exceptions: some researchers use dropout at test time as a heuristic for estimating the uncertainty* of neural network predictions: if the predictions agree across many different dropout masks, then we might say that the network is more confident. For now we will put off uncertainty estimation for subsequent chapters and volumes.
To implement the dropout function for a single layer, we must draw as many samples from a Bernoulli (binary) random variable as our layer has dimensions, where the random variable takes value $1$ (keep) with probability $1-p$ and $0$ (drop) with probability $p$. One easy way to implement this is to first draw samples from the uniform distribution $U[0, 1]$. Then we can keep those nodes for which the corresponding sample is greater than $p$, dropping the rest.
In the following code, we implement a dropoutLayer()
function
that drops out the elements in the NDArray
input X
with probability dropout
,
rescaling the remainder as described above
(dividing the survivors by 1.0-dropout
).
%load ../utils/djl-imports
%load ../utils/plot-utils.ipynb
%load ../utils/DataPoints.java
%load ../utils/Training.java
%load ../utils/Accumulator.java
import ai.djl.basicdataset.cv.classification.*;
import org.apache.commons.lang3.ArrayUtils;
We can test out the dropoutLayer()
function on a few examples.
In the following lines of code,
we pass our input X
through the dropout operation,
with probabilities 0, 0.5, and 1, respectively.
NDManager manager = NDManager.newBaseManager();
public NDArray dropoutLayer(NDArray X, float dropout) {
// In this case, all elements are dropped out
if (dropout == 1.0f) {
return manager.zeros(X.getShape());
}
// In this case, all elements are kept
if (dropout == 0f) {
return X;
}
NDArray mask = manager.randomUniform(0f, 1.0f, X.getShape()).gt(dropout);
return mask.toType(DataType.FLOAT32, false).mul(X).div(1.0f - dropout);
}
NDArray X = manager.arange(16f).reshape(2, 8);
System.out.println(dropoutLayer(X, 0));
System.out.println(dropoutLayer(X, 0.5f));
System.out.println(dropoutLayer(X, 1.0f));
Again, we work with the Fashion-MNIST dataset
introduced in :numref:sec_softmax_scratch
.
We define a multilayer perceptron with
two hidden layers containing 256 outputs each.
int numInputs = 784;
int numOutputs = 10;
int numHiddens1 = 256;
int numHiddens2 = 256;
NDArray W1 = manager.randomNormal(0, 0.01f, new Shape(numInputs, numHiddens1), DataType.FLOAT32);
NDArray b1 = manager.zeros(new Shape(numHiddens1));
NDArray W2 = manager.randomNormal(0, 0.01f, new Shape(numHiddens1, numHiddens2), DataType.FLOAT32);
NDArray b2 = manager.zeros(new Shape(numHiddens2));
NDArray W3 = manager.randomNormal(0, 0.01f, new Shape(numHiddens2, numOutputs), DataType.FLOAT32);
NDArray b3 = manager.zeros(new Shape(numOutputs));
NDList params = new NDList(W1, b1, W2, b2, W3, b3);
for (NDArray param : params) {
param.setRequiresGradient(true);
}
The model below applies dropout to the output
of each hidden layer (following the activation function).
We can set dropout probabilities for each layer separately.
A common trend is to set
a lower dropout probability closer to the input layer.
Below we set it to 0.2 and 0.5 for the first
and second hidden layer respectively.
By using the isTraining
boolean variable described in :numref:sec_autograd
,
we can ensure that dropout is only active during training.
float dropout1 = 0.2f;
float dropout2 = 0.5f;
public NDArray net(NDArray X, boolean isTraining) {
X = X.reshape(-1, numInputs);
NDArray H1 = Activation.relu(X.dot(W1).add(b1));
if (isTraining) {
H1 = dropoutLayer(H1, dropout1);
}
NDArray H2 = Activation.relu(H1.dot(W2).add(b2));
if (isTraining) {
H2 = dropoutLayer(H2, dropout2);
}
return H2.dot(W3).add(b3);
}
This is similar to the training and testing of multilayer perceptrons described previously.
int numEpochs = Integer.getInteger("MAX_EPOCH", 10);
float lr = 0.5f;
int batchSize = 256;
double[] trainLoss;
double[] testAccuracy;
double[] epochCount;
double[] trainAccuracy;
trainLoss = new double[numEpochs];
trainAccuracy = new double[numEpochs];
testAccuracy = new double[numEpochs];
epochCount = new double[numEpochs];
Loss loss = new SoftmaxCrossEntropyLoss();
FashionMnist trainIter = FashionMnist.builder()
.optUsage(Dataset.Usage.TRAIN)
.setSampling(batchSize, true)
.optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
.build();
FashionMnist testIter = FashionMnist.builder()
.optUsage(Dataset.Usage.TEST)
.setSampling(batchSize, true)
.optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
.build();
trainIter.prepare();
testIter.prepare();
float epochLoss = 0f;
float accuracyVal = 0f;
for (int epoch = 1; epoch <= numEpochs; epoch++) {
// Iterate over dataset
System.out.print("Running epoch " + epoch + "...... ");
for (Batch batch : trainIter.getData(manager)) {
NDArray X = batch.getData().head();
NDArray y = batch.getLabels().head();
try (GradientCollector gc = Engine.getInstance().newGradientCollector()) {
NDArray yHat = net(X, true); // net function call
NDArray lossValue = loss.evaluate(new NDList(y), new NDList(yHat));
NDArray l = lossValue.mul(batchSize);
epochLoss += l.sum().getFloat();
accuracyVal += Training.accuracy(yHat, y);
gc.backward(l); // gradient calculation
}
batch.close();
Training.sgd(params, lr, batchSize); // updater
}
trainLoss[epoch-1] = epochLoss/trainIter.size();
trainAccuracy[epoch-1] = accuracyVal/trainIter.size();
epochLoss = 0f;
accuracyVal = 0f;
for (Batch batch : testIter.getData(manager)) {
NDArray X = batch.getData().head();
NDArray y = batch.getLabels().head();
NDArray yHat = net(X, false); // net function call
accuracyVal += Training.accuracy(yHat, y);
}
testAccuracy[epoch-1] = accuracyVal/testIter.size();
epochCount[epoch-1] = epoch;
accuracyVal = 0f;
System.out.println("Finished epoch " + epoch);
}
System.out.println("Finished training!");
String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");
Table data = Table.create("Data").addColumns(
DoubleColumn.create("epochCount", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
DoubleColumn.create("loss", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),
StringColumn.create("lossLabel", lossLabel)
);
render(LinePlot.create("", data, "epochCount", "loss", "lossLabel"),"text/html");
Using DJL, all we need to do is add a Dropout
layer
after each fully-connected layer,
passing in the dropout probability
as the only argument to its constructor.
During training, the Dropout
layer will randomly
drop out outputs of the previous layer
(or equivalently, the inputs to the subsequent layer)
according to the specified dropout probability.
When the model is not in training mode,
the Dropout
layer simply passes the data through during testing.
SequentialBlock net = new SequentialBlock();
net.add(Blocks.batchFlattenBlock(784));
net.add(Linear.builder().setUnits(256).build());
net.add(Activation::relu);
net.add(Dropout.builder().optRate(dropout1).build());
net.add(Linear.builder().setUnits(256).build());
net.add(Activation::relu);
net.add(Dropout.builder().optRate(dropout2).build());
net.add(Linear.builder().setUnits(10).build());
net.setInitializer(new NormalInitializer(0.01f), Parameter.Type.WEIGHT);
Next, we train and test the model.
Map<String, double[]> evaluatorMetrics = new HashMap<>();
Tracker lrt = Tracker.fixed(0.5f);
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();
Loss loss = Loss.softmaxCrossEntropyLoss();
DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(sgd) // Optimizer (loss function)
.optDevices(Engine.getInstance().getDevices(1)) // single GPU
.addEvaluator(new Accuracy()) // Model Accuracy
.addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
try (Model model = Model.newInstance("mlp")) {
model.setBlock(net);
try (Trainer trainer = model.newTrainer(config)) {
trainer.initialize(new Shape(1, 784));
trainer.setMetrics(new Metrics());
EasyTrain.fit(trainer, numEpochs, trainIter, testIter);
Metrics metrics = trainer.getMetrics();
trainer.getEvaluators().stream()
.forEach(evaluator -> {
evaluatorMetrics.put("train_epoch_" + evaluator.getName(), metrics.getMetric("train_epoch_" + evaluator.getName()).stream()
.mapToDouble(x -> x.getValue().doubleValue()).toArray());
evaluatorMetrics.put("validate_epoch_" + evaluator.getName(), metrics.getMetric("validate_epoch_" + evaluator.getName()).stream()
.mapToDouble(x -> x.getValue().doubleValue()).toArray());
});
}
}
trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");
trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");
testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");
String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
Arrays.fill(lossLabel, 0, trainLoss.length, "test acc");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
trainLoss.length + testAccuracy.length + trainAccuracy.length, "train loss");
Table data = Table.create("Data").addColumns(
DoubleColumn.create("epochCount", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
DoubleColumn.create("loss", ArrayUtils.addAll(testAccuracy , ArrayUtils.addAll(trainAccuracy, trainLoss))),
StringColumn.create("lossLabel", lossLabel)
);
render(LinePlot.create("", data, "epochCount", "loss", "lossLabel"),"text/html");