In [1]:
# %load /Users/facai/Study/book_notes/
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['axes.grid'] = False

#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
    if figsize:

    if res_dir:
        filename = './res/{}'.format(filename)


Chapter 8 Optimization for Training Deep Models

optimization: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$.

8.1 How Learning Differs from Pure Optimization

expectation is taken across the data generating distribution $p_{data}$ rather than just over the finite training set:

\begin{equation} J^*(\theta) = \mathcal{E}_{(x, y) \sim p_{data}} L(f(x; \theta), y) \end{equation}

8.1.1 Empirical Risk Minimization

Rather than optimizing the risk direcly, we optimize the empirical risk, and hope that the risk decreases significantly as well.

8.1.2 Surrogate Loss Functions and Early Stopping

8.1.3 Batch and Minibatch Algorithms

Small batches can offer a regularing effect.

gradient can handle smaller batch size like 100, while second-order methods typically require much large batch sizes like 10,000.

minibatches must be selected randomly. For very large datasets, it is usually sufficient to shuffle the order of the dataset once and then store it in shuffled fashion.

On the second pass, the estimator becomes biased because it is formed by re-sampling values used before.

8.2 Challenges in Neural Network Optimization

8.2.1 Ill-Conditioning

To determin whether ill-conditioning, one can monitor the squared gradient norm $g^T g$ and the $g^T H g$ term. In many cases, the gradient norm does not shrink significantly throughout learning, but the $g^T H g$ term grows by more than order of magnitude.

8.2.2 Local Minima

model identifiability problem: models with latent variables are often not identifiable <= weight space symmetry.

8.2.3 Plateaus, Saddle Points and Other Flat Regions

  • saddle point: local minimum along one cross-section, and local maximum along another another cross-section.

    • in higher dimensional spaces, local minima are rare and saddle points are more common.
    • difficult for newton's method, while easy for gradient descent.
  • maxima

  • wide, flat regions of constant value

8.2.4 Cliffs and Exploding Gradients

In [2]:

can be avoided using the gradient clipping heuristic

8.2.5 Long-Term Dependencies

when graph becomes extremely deep => vanishing and exploding gradient problem

Vanishing gradients make it difficult to know which direction the parameters should move to improve to the cost function, while exploding gradients can make learning unstable.

8.2.6 Inexact Gradients

8.2.7 Poor Correspondence between Local and Global Structure

Many existing research directions are aimed at finding good initial points, rather than developing algorithms that use non-local moves.

8.2.8 Theoretical Limits of Optimization

8.3 Basic Algorithms

8.3.1 Stochastic Gradient Descent

In practice, it is necessary to gradually decrease the learning rate over time.

In practice, it is common to decay the learning rate linearly until iteration $\tau$:

\begin{equation} \epsilon_k = (1 - \alpha) \epsilon_0 + \alpha \epsilon_{\tau} \end{equation}
  • $\tau$: a few hundred passes through the training set
  • $\epsilon_\tau \approx 1\% \, \epsilon_0$
  • $\epsilon_0$: monitor the first several iterations and use a learning rate that is higher than the best-performing learning rate at this time, but not so high that it causes severe instability.

To study the convergence rate of an optimization algorithm, it is common to measure the excess error $J(\theta) - \min_\theta J(\theta)$.

  • SGD is applied to a convex problem: $O(\frac{1}{\sqrt{k}}$
  • in the stronly convex case it is $O(\frac{1}{k})$.

8.3.2 Momentum

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}

The larger $\alpha$ is relative to $\epsilon$, the more previous gradients affect the current direction.

In [3]:

the size of each step is $\frac{\epsilon \| g \|}{1 - \alpha} \implies$ it is thus helpful to think of the momentum hyperparameter in terms of $\frac1{1 - \alpha}$.

  • Common values of $\alpha$ used in practice include 0.5, 0.9 and 0.99.
  • Typically it begins with a small value and is later raised.
  • It is less important to adapt $\alpha$ over time than to shrink $\epsilon$ over time.

8.3.3 Nesterov Momentum

Nesterov momentum: the gradient is evaluated after the current velocity is applied.

\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta + \color{blue}{\alpha v}), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}


8.4 Parameter Initialization Strategies

Designing improved initialization strategies is a difficult task because neural network optimization is not yet well understood.

A further difficulty is that some initial points may be benefical from the viewpoint of optimization but detrimental from the viewpoint of generalization.

complete certainty: break symmetry between different units

  • initialize each unit to compute a different function from all of the other units.
  • random initialization of the parameters.
    • Typically, set biases for each unit to heuristically chosen constants, and initilize only the weights randomly.

We can think of initializing the parameters $\theta$ to $\theta_0$ as being similar to imposing a Gaussian prior $p(\theta)$ with mean $\theta_0$.
$\implies$ choose $\theta_0$ to be near 0 = more likely that units do not interact with each other than that they do interact.

  1. normalized initialization: $W_{i, j} \sim U \left ( - \frac{6}{\sqrt{m + n}}, \frac{6}{\sqrt{m + n}} \right )$
  2. initializing to random orthogonal matrices
  3. perserve norms
  4. sparse initialization

A good rule of thumb for choosing the initial scales is to look at the range or standard deviation of activations or gradients on a single minibatch of data.

  1. Setting the biases to zero is compatible with most weight initialization schemes.
  2. a few situations where we may set some biases to non-zero values:
    • for an output unit, often feneficial to initialize the bias to obtain the right marginal statistics of the output.
    • choose the bias to avoid causing too much saturation at initialization.
      eg: set the bias of ReLU hidden unit to 0.1 rather than 0
    • Sometimes a unit controls whether other units are able to participate in a function => all units have a chance to learn.
initialize model parameters using machine learning

eg: to initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs.

8.5 Algorithms with Adaptive Learning Rates

the cost is often highly sensitive to some directions in parameter space and insensitive to others. => adapt the learning rates of model parameters.

8.5.1 AdaGrad

gradient accumulation

$\text{rate} = \frac1{\sum \text{squared gradients}}$

8.5.2 RMSProp

changing the gradient accumulation into an exponentially weighted moving average.

8.5.3 Adam

adaptive moments: combination of RMSProp and momentum

Currently, the most popular optimization algorithms actively in use include

  • SGD,
  • SGD with momentum,
  • RMSProp,
  • RMSProp with momentum,
  • AdaDelta and Adam.

The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

8.6 Approximate Second-Order Methods

8.6.1 Newton's Method

a two-step iterative procedure:

  • update or compute the inverse Hessian
  • update the parameters: $\theta^* = \theta_0 - \mathbf{H}^{-1} \Delta_\theta J(\theta_0)$

8.6.2 Conjugate Gradients

8.6.3 BFGS


8.7 Optimization Strategies and Meta-Algorithms

8.7.1 Batch Normalization

adaptive reparametrization => training very deep models

8.7.2 Coordinate Descent

bad: variables are dependent.

8.7.3 Polyak Averaging

averaging points.

8.7.4 Supervised Pretraining

training sample models on simple tasks => then make the model more complex.

8.7.5 Designing Models to Aid Optimization

In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.

8.7.6 Continuous Methods and Curriculumn Learning

In [ ]: