< Data | Autograd | Optimization >¶

Automatic Differentiation¶

Automatic differentation (autodiff) is a key feature of PyTorch. PyTorch can differentiate the outcome of any computation with respect to its inputs. You don't need to compute the gradients yourself. This allows you to express and optimize complex models without worrying about correctly differentiating the model.

We will start by discussing a little bit of the math behind autodiff. We then cover PyTorch's .backward() method that does everything automatically for you. Finally, we have a quick look under the hood to see how PyTorch does its magic.

Table of Contents¶

1. Understanding gradient computation ¶

2. A linear regression example ¶

3. Useful features ¶

4. Advanced topics ¶

In [ ]:

import torch
torch.__version__

Understanding gradient computation¶

Requires grad attribute¶

The requires_grad property on a Tensor tells PyTorch to track computations based on this tensor. After you compute a quantity y (forward pass), you can compute the gradient of y with respect to all tensors that have requires_grad==True.

In [ ]:

x = torch.Tensor([2])
print(x)

In [ ]:

print(x.requires_grad)

In [ ]:

x.requires_grad_(True)  # note the underscore

In [ ]:

print(x.requires_grad)

Checking how Autograd tracks operations¶

In [ ]:

y = x * x
print(y)

In [ ]:

print(z.requires_grad)

In [ ]:

print(z.grad_fn)

In [ ]:

z = y + 4
print(z)

Computing gradients with `.backward()`¶

The gradient computation (the backward pass) is triggered with z.backward(). You will find the computed gradients in x.grad.

This computes the gradient of z with respect to x.

In [ ]:

print(x.grad)

In [ ]:

z.backward()

In [ ]:

print(x.grad)

Here, we have $z(x) = x^2 + 4$, therefore $\frac{dz}{dx}(x) = 2 x$. We can indeed check that x.grad = 2*x

This simple polynomial expression is easy enough to differentiate by hand. When expressions become tensor-valued and more complex, however, computing gradients becomes tedious and error-prone. The power of PyTorch is that it can compute gradients of any tensor with respect to its 'inputs' automatically. This greatly simplifies the optimization of complex, creative ML models.

Remember:

tensor.requires_grad
tensor.grad
tensor.backward()

A linear regression example¶

We have the simple linear regression $loss = (x \cdot W + b - y)^{2}$

Let's create our sample data point x and y, and our regression parameters W and b.

Since we want to update W and b, we need gradient with respect to them, so we set their requires_grad attribute to True.

In [ ]:

x = torch.Tensor([1,2,3])
y = torch.Tensor([1])

W = torch.rand((3,1), requires_grad=True)
b = torch.rand(1, requires_grad=True)
print(W, "\n\n", b)

In [ ]:

loss = (x @ W + b - y) ** 2
print(loss)

Before calling backward, all gradients are None

In [ ]:

print(W.grad, b.grad)

In [ ]:

loss.backward()

After calling backward, gradients of all parameters have been computed !

In [ ]:

print(W.grad, "\n\n", b.grad)

Note: No gradient of the loss is computed with respect to x and y since they do not require gradient.

In [ ]:

print(x.grad, y.grad)

Gradients accumulate !¶

In [ ]:

loss = (x @ W + b - y) ** 2
loss.backward()

In [ ]:

print(W.grad, "\n\n", b.grad)

You see that the second time, the gradient computed is twice as big. This is because .backward() accumulates the gradients.

If you want fresh gradient values, you need to set the .grad attributes of the parameters to zero before you call .backward().

Useful features¶

Skipping history tracking with `torch.no_grad()`¶

After you trained a model, you just want to use it without computing gradients. Building a computation graph for every operation would be wasteful if you don't need it. Therefore, you can skip these operations by wrapping your code with the with torch.no_grad(): context.

In [ ]:

x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)

y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)

with torch.no_grad():
    y = (x ** 2)
    print("y.requires_grad : ", y.requires_grad)

Any variable created within the no_grad context will have requires_grad==False.

Dropping history with `.detach()`¶

Some tensors are computed from others, but you may want to consider them constants without computation history (called leaf variables). For that, you can use the .detach() method.

In [ ]:

A = torch.rand(1,2, requires_grad=True)
B = A.mean()

print("B : ", B)
print("B.requires_grad :", B.requires_grad)
print("B.grad_fn :", B.grad_fn)

C = B.detach()
print("\n-- C = B.detach() -- \n")

print("C : ", C)
print("C.requires_grad :", C.requires_grad)
print("C.grad_fn :", C.grad_fn)

Advanced topics¶

Leaves vs Nodes¶

Advanced

PyTorch's autograd mechanism differentiates between two types of tensors:

node variables are the result of a pytorch operation
leaf variables are directly created by a user

We can use the .is_leaf property to differentiate between the two types.

In [ ]:

A = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
B = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) + 2  # B is the result of an operation (+)
C = 5 * A  # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)

When .backward() is called, only the leaf variables have their gradients stored in their .grad attribute.

Differentiating w.r.t. intermediate values: `.retain_grad()`¶

Advanced

When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables in the computation graph. However, by default, only gradients of variables that were created by the user (leaf) and have the requires_grad property to True are saved.

Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters (which are leaf variables).

In [ ]:

A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
C = B.mean()

print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

In [ ]:

A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
B.retain_grad()  # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()


print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

Inspecting PyTorch's computation graph¶

Advanced

You can explore how PyTorch keeps track of history by inspecting the tensor.grad_fn argument:

In [ ]:

print(y.grad_fn)
print(y.grad_fn.next_functions[0][0])
print(y.grad_fn.next_functions[0][0].next_functions[0][0])

Each value has a grad_fn corresponding to the operation that produced the value. Each operation's grad_fn points to its inputs through next_functions. For each input, next_functions contains a tuple of the input's grad_fn and, if the operation had multiple outputs, an index of the relevant output.

In [ ]:

# In our example, the final `add` operation has two inputs:
# - The first is the output of `multiplication`.
# - The second is a constant `4` for which we don't require a gradient.
y.grad_fn.next_functions

< Data | Autograd | Optimization >¶

< Data | Autograd | Optimization >¶

Automatic Differentiation¶

Table of Contents¶

1. Understanding gradient computation¶

2. A linear regression example¶

3. Useful features¶

4. Advanced topics¶

Understanding gradient computation¶

Requires grad attribute¶

Checking how Autograd tracks operations¶

Computing gradients with .backward()¶

A linear regression example¶

Gradients accumulate !¶

Useful features¶

Skipping history tracking with torch.no_grad()¶

Dropping history with .detach()¶

Advanced topics¶

Leaves vs Nodes¶

Differentiating w.r.t. intermediate values: .retain_grad()¶

Inspecting PyTorch's computation graph¶

< Data | Autograd | Optimization >¶

1. Understanding gradient computation ¶

2. A linear regression example ¶

3. Useful features ¶

4. Advanced topics ¶

Computing gradients with `.backward()`¶

Skipping history tracking with `torch.no_grad()`¶

Dropping history with `.detach()`¶

Differentiating w.r.t. intermediate values: `.retain_grad()`¶