In this section, we will discuss the relationship between optimization and deep learning as well as the challenges of using optimization in deep learning. For a deep learning problem, we will usually define a loss function first. Once we have the loss function, we can use an optimization algorithm in attempt to minimize the loss. In optimization, a loss function is often referred to as the objective function of the optimization problem. By tradition and convention most optimization algorithms are concerned with minimization. If we ever need to maximize an objective there is a simple solution: just flip the sign on the objective.
Although optimization provides a way to minimize the loss function for deep
learning, in essence, the goals of optimization and deep learning are
fundamentally different. The former is primarily concerned with minimizing an
objective whereas the latter is concerned with finding a suitable model, given a
finite amount of data. In :numref:sec_model_selection
,
we discussed the difference between these two goals in detail. For instance,
training error and generalization error generally differ: since the objective
function of the optimization algorithm is usually a loss function based on the
training dataset, the goal of optimization is to reduce the training error.
However, the goal of statistical inference (and thus of deep learning) is to
reduce the generalization error. To accomplish the latter we need to pay
attention to overfitting in addition to using the optimization algorithm to
reduce the training error. We begin by importing a few libraries with a function to annotate in a figure.
%load ../utils/djl-imports
%load ../utils/plot-utils
import org.apache.commons.lang3.ArrayUtils;
The graph below illustrates the issue in some more detail. Since we have only a finite amount of data the minimum of the training error may be at a different location than the minimum of the expected error (or of the test error).
// Saved in Functions class for later use
public float[] callFunc(float[] x, Function<Float, Float> func) {
float[] y = new float[x.length];
for (int i = 0; i < x.length; i++) {
y[i] = func.apply(x[i]);
}
return y;
}
Function<Float, Float> f = x -> x * (float)Math.cos(Math.PI * x);
Function<Float, Float> g = x -> f.apply(x) + 0.2f * (float)Math.cos(5 * Math.PI * x);
NDManager manager = NDManager.newBaseManager();
NDArray X = manager.arange(0.5f, 1.5f, 0.01f);
float[] x = X.toFloatArray();
float[] fx = callFunc(x, f);
float[] gx = callFunc(x, g);
String[] grouping = new String[x.length * 2];
for (int i = 0; i < x.length; i++) {
grouping[i] = "Expected Risk";
grouping[i + x.length] = "Empirical Risk";
}
Table data = Table.create("Data")
.addColumns(
FloatColumn.create("x", ArrayUtils.addAll(x, x)),
FloatColumn.create("risk", ArrayUtils.addAll(fx, gx)),
StringColumn.create("grouping", grouping)
);
LinePlot.create("Risk", data, "x", "risk", "grouping");
In this chapter, we are going to focus specifically on the performance of the
optimization algorithm in minimizing the objective function, rather than a
model's generalization error. In :numref:sec_linear_regression
we distinguished between analytical solutions and numerical solutions in
optimization problems. In deep learning, most objective functions are
complicated and do not have analytical solutions. Instead, we must use numerical
optimization algorithms. The optimization algorithms below all fall into this
category.
There are many challenges in deep learning optimization. Some of the most vexing ones are local minima, saddle points and vanishing gradients. Let us have a look at a few of them.
For the objective function $f(x)$, if the value of $f(x)$ at $x$ is smaller than the values of $f(x)$ at any other points in the vicinity of $x$, then $f(x)$ could be a local minimum. If the value of $f(x)$ at $x$ is the minimum of the objective function over the entire domain, then $f(x)$ is the global minimum.
For example, given the function
$$f(x) = x \cdot \text{cos}(\pi x) \text{ for } -1.0 \leq x \leq 2.0,$$we can approximate the local minimum and global minimum of this function.
NDArray X = manager.arange(-1.0f, 2.0f, 0.01f);
float[] x = X.toFloatArray();
float[] fx = callFunc(x, f);
Table data = Table.create("Data")
.addColumns(
FloatColumn.create("x", x),
FloatColumn.create("f(x)", fx)
);
LinePlot.create("x * cos(pi * x)", data, "x", "f(x)");
The local minimum is at (0.3, -0.25) and the global minimum is at (1.1, -0.95).
The objective function of deep learning models usually has many local optima. When the numerical solution of an optimization problem is near the local optimum, the numerical solution obtained by the final iteration may only minimize the objective function locally, rather than globally, as the gradient of the objective function's solutions approaches or becomes zero. Only some degree of noise might knock the parameter out of the local minimum. In fact, this is one of the beneficial properties of stochastic gradient descent where the natural variation of gradients over minibatches is able to dislodge the parameters from local minima.
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point is any location where all gradients of a function vanish but which is neither a global nor a local minimum. Consider the function $f(x) = x^3$. Its first and second derivative vanish for $x=0$. Optimization might stall at the point, even though it is not a minimum.
Function<Float, Float> cube = x -> x * x * x;
NDArray X = manager.arange(-2.0f, 2.0f, 0.01f);
float[] x = X.toFloatArray();
float[] fx = callFunc(x, cube);
Table data = Table.create("Data")
.addColumns(
FloatColumn.create("x", x),
FloatColumn.create("f(x)", fx)
);
LinePlot.create("x^3", data, "x", "f(x)");
Saddle points in higher dimensions are even more insidious, as the example below shows. Consider the function $f(x, y) = x^2 - y^2$. It has its saddle point at $(0, 0)$. This is a maximum with respect to $y$ and a minimum with respect to $x$. Moreover, it looks like a saddle, which is where this mathematical property got its name.
We assume that the input of a function is a $k$-dimensional vector and its
output is a scalar, so its Hessian matrix will have $k$ eigenvalues
(refer to :numref:sec_geometry-linear-algebraic-ops
).
The solution of the
function could be a local minimum, a local maximum, or a saddle point at a
position where the function gradient is zero:
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative is quite high. This makes saddle points more likely than local minima. We will discuss some exceptions to this situation in the next section when introducing convexity. In short, convex functions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep learning problems do not fall into this category. Nonetheless it is a great tool to study optimization algorithms.
Probably the most insidious problem to encounter are vanishing gradients. For instance, assume that we want to minimize the function $f(x) = \tanh(x)$ and we happen to get started at $x = 4$. As we can see, the gradient of $f$ is close to nil. More specifically $f'(x) = 1 - \tanh^2(x)$ and thus $f'(4) = 0.0013$. Consequently optimization will get stuck for a long time before we make progress. This turns out to be one of the reasons that training deep learning models was quite tricky prior to the introduction of the ReLU activation function.
We can see at the top right of the tanh graph the line becoming parallel to x hence the vanishing gradient
.
Function<Float, Float> tanh = x -> (float)Math.tanh(x);
NDArray X = manager.arange(-2.0f, 5.0f, 0.01f);
float[] x = X.toFloatArray();
float[] fx = callFunc(x, tanh);
Table data = Table.create("Data")
.addColumns(
FloatColumn.create("x", x),
FloatColumn.create("f(x)", fx)
);
LinePlot.create("tanh", data, "x", "f(x)");
As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust range of algorithms that perform well and that are easy to use even for beginners. Furthermore, it is not really necessary to find the best solution. Local optima or even approximate solutions thereof are still very useful.
Wigner.1958
for details).