< Autograd | Optimization | Modules >¶

Optimization¶

In this short notebook, we will see how to use the gradient obtained with Autograd to perform optimization of an objective function.
Then we will also present some off-the-shelf Pytorch optimizers, loss functions, and learning rate schedulers.

Table of Contents¶

1. Optimizing by hand ¶

2. Optimizing with an Optimizer ¶

3. 2D Plots ¶

In [ ]:

import sys
import torch


if 'google.colab' in sys.modules: # Execute if you're using Google Colab
    !wget -q https://raw.githubusercontent.com/theevann/amld-pytorch-workshop/master/live_plot.py -O live_plot.py
    !pip install -q ipympl

%matplotlib ipympl
torch.set_printoptions(precision=3)

Optimizing by hand¶

We will start with a simple example : finding the zero of the cube function.

In [ ]:

def f(x):
    return x ** 3

We will search the zero of the function $f$ "by hand" by minimizing $\|f(x) \|_1$ using the gradient descent algorithm.

We will call $\|f(x) \|_1$ the loss as this is the value we wish to minimise.

As a reminder, the update step of the gradient descent algorithm is: $$x_{t+1} = x_{t} - \lambda \nabla_x f (x_t)$$

Note:

You can use .abs() to compute the loss
The gradient information $\nabla_x f (x)$ will be stored in x.grad once we run the backward function.
The gradient is accumulated by default, so we need to clear x.grad after each iteration.
We need to use with torch.no_grad(): context for the update step since we want to change x in place but don't want autograd to track this change.

Your turn !¶

In [ ]:

x0 = 7
lr = 0.01
iterations = 10

x = torch.Tensor([x0]).requires_grad_()
y = f(x)

for i in range(iterations):
    
    # YOUR TURN
    # Compute y, the loss and the gradient of the loss wrt. x
    
    with torch.no_grad():
        # Update x
        # Clear the gradient of x
    
    print(y.item())

Why do we use `with torch.no_grad()` ?¶

Because x "requires grad", any operation we apply to x is recorded for automatic differentiation. As we don't want to track the update step of the parameters, we need to "tell" autograd not to track this change. This is done by using the torch.no_grad() context.

Optimizing with an optimizer¶

Different optimizers¶

PyTorch provides most common optimization algorithms encapsulated into "optimizer classes".
An optimizer is an object that automatically loops through all the numerous parameters of your model and performs the (potentially complex) update step for you.

You first need to import torch.optim.

In [ ]:

import torch.optim as optim

Below are the most commonly used optimizers. Each of them has its specific parameters that you can check on the Pytorch Doc.

In [ ]:

parameters = [x]  # This should be the list of model parameters

optimizer = optim.SGD(parameters, lr=0.01, momentum=0.9)
optimizer = optim.Adam(parameters, lr=0.01)
optimizer = optim.Adadelta(parameters, lr=0.01)
optimizer = optim.Adagrad(parameters, lr=0.01)
optimizer = optim.RMSprop(parameters, lr=0.01)
optimizer = optim.LBFGS(parameters, lr=0.01)

# and there is more ...

Different loss functions¶

PyTorch comes with a lot of predefined loss functions :

L1Loss
MSELoss
CrossEntropyLoss
NLLLoss
PoissonNLLLoss
KLDivLoss
BCELoss
...

Check out the PyTorch Documentation.

In [ ]:

loss_function = torch.nn.L1Loss()

In [ ]:

x = torch.Tensor([1,1,1])
y = torch.Tensor([1,2,3])

loss_function(x, y)

This ${L}_1$ loss_function computes the average absolute difference between $x$ and $y$ as $\dfrac{1}{size(x)} \| x - y \|_{1}$.

Using an optimizer¶

Now, let's use an optimizer to do the optimization !

You will need 2 new functions:

optimizer.zero_grad() : This function sets the gradient of the parameters (x here) to 0 (otherwise it will get accumulated)
optimizer.step() : This function applies an update step

We will also use a loss function which you need to call with:

as first argument your function output y
as second argument your target value y_target

Your turn !¶

In [ ]:

x0 = 8
lr = 0.01
iterations = 10

x = torch.Tensor([x0]).requires_grad_()
y = f(x)
y_target = torch.Tensor([0])

# Define your optimizer
optimizer =  # < YOUR CODE HERE >
loss_function = # < YOUR CODE HERE >

for i in range(iterations):
    
    # < YOUR CODE HERE >
    
    print(y.data)

Using a learning rate scheduler¶

In addition to an optimizer, a learning rate scheduler can be used to adjust the learning rate during training by reducing it according to a pre-defined schedule.
Below are some of the schedulers available in PyTorch.

In [ ]:

optim.lr_scheduler.LambdaLR
optim.lr_scheduler.ExponentialLR
optim.lr_scheduler.MultiStepLR
optim.lr_scheduler.StepLR

# and some more ...

Let's try optim.lr_scheduler.ExponentialLR:

In [ ]:

def f(x):
    return x.abs() * 5

In [ ]:

x0 = 8
lr = 0.5
iterations = 150

x = torch.Tensor([x0]).requires_grad_()
y_target = torch.Tensor([0])

optimizer = optim.SGD([x], lr=lr)
loss_function = torch.nn.L1Loss()
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, 0.8)

for i in range(iterations):
    optimizer.zero_grad()
    y = f(x)
    loss = loss_function(y, y_target)
    loss.backward()
    optimizer.step()
    scheduler.step()
    print(y.data, " | lr : ", optimizer.param_groups[0]['lr'])

2D Plot - Optimization process¶

Below are some live plots to see what actually happens when you optimize a function.
You can play with learning rates, optimizers and also define new functions to optimize !

Note: These are not stricly speaking live plots as it is not possible to do so in colab. We actually create a video of the optimization process instead

In [ ]:

from live_plot import anim_2d

In [ ]:

def function_2d(x):
    return x ** 2 / 20 + x.sin().tanh()

In [ ]:

x0 = 8
lr = 2
iterations = 15
points= []

x_range = torch.arange(-10, 10, 0.1)
x = torch.Tensor([x0]).requires_grad_()
optimizer = torch.optim.Adam([x], lr=lr)

for i in range(iterations):
    optimizer.zero_grad()
    f = function_2d(x)
    f.backward()
    points += [(x.item(), f.item())]
    optimizer.step()
    
anim_2d(x_range, function_2d, points, 400)

< Autograd | Optimization | Modules >¶

< Autograd | Optimization | Modules >¶

Optimization¶

Table of Contents¶

1. Optimizing by hand¶

2. Optimizing with an Optimizer¶

3. 2D Plots¶

Optimizing by hand¶

Your turn !¶

Why do we use with torch.no_grad() ?¶

Optimizing with an optimizer¶

Different optimizers¶

Different loss functions¶

Using an optimizer¶

Your turn !¶

Using a learning rate scheduler¶

2D Plot - Optimization process¶

< Autograd | Optimization | Modules >¶

1. Optimizing by hand ¶

2. Optimizing with an Optimizer ¶

3. 2D Plots ¶

Why do we use `with torch.no_grad()` ?¶