Automatic differentation (autodiff) is a key feature of PyTorch. PyTorch can differentiate the outcome of any computation with respect to its inputs. You don't need to compute the gradients yourself. This allows you to express and optimize complex models without worrying about correctly differentiating the model.
We will start by discussing a little bit of the math behind autodiff. We then cover PyTorch's .backward()
method that does everything automatically for you. Finally, we have a quick look under the hood to see how PyTorch does its magic.
import torch
torch.__version__
The requires_grad
property on a Tensor tells PyTorch to track computations based on this tensor.
After you compute a quantity y
(forward pass), you can compute the gradient of y
with respect to all tensors that have requires_grad==True
.
x = torch.Tensor([2])
print(x)
print(x.requires_grad)
x.requires_grad_(True) # note the underscore
print(x.requires_grad)
y = x * x
print(y)
print(z.requires_grad)
print(z.grad_fn)
z = y + 4
print(z)
.backward()
¶The gradient computation (the backward pass) is triggered with z.backward()
. You will find the computed gradients in x.grad
.
This computes the gradient of z with respect to x.
print(x.grad)
z.backward()
print(x.grad)
Here, we have $z(x) = x^2 + 4$, therefore $\frac{dz}{dx}(x) = 2 x$. We can indeed check that x.grad = 2*x
This simple polynomial expression is easy enough to differentiate by hand. When expressions become tensor-valued and more complex, however, computing gradients becomes tedious and error-prone. The power of PyTorch is that it can compute gradients of any tensor with respect to its 'inputs' automatically. This greatly simplifies the optimization of complex, creative ML models.
Remember:
tensor.requires_grad
tensor.grad
tensor.backward()
We have the simple linear regression $loss = (x \cdot W + b - y)^{2}$
Let's create our sample data point x
and y
, and our regression parameters W
and b
.
Since we want to update W
and b
, we need gradient with respect to them, so we set their requires_grad
attribute to True
.
x = torch.Tensor([1,2,3])
y = torch.Tensor([1])
W = torch.rand((3,1), requires_grad=True)
b = torch.rand(1, requires_grad=True)
print(W, "\n\n", b)
loss = (x @ W + b - y) ** 2
print(loss)
Before calling backward, all gradients are None
print(W.grad, b.grad)
loss.backward()
After calling backward, gradients of all parameters have been computed !
print(W.grad, "\n\n", b.grad)
Note: No gradient of the loss
is computed with respect to x
and y
since they do not require gradient.
print(x.grad, y.grad)
loss = (x @ W + b - y) ** 2
loss.backward()
print(W.grad, "\n\n", b.grad)
You see that the second time, the gradient computed is twice as big. This is because .backward()
accumulates the gradients.
If you want fresh gradient values, you need to set the .grad
attributes of the parameters to zero before you call .backward()
.
torch.no_grad()
¶After you trained a model, you just want to use it without computing gradients.
Building a computation graph for every operation would be wasteful if you don't need it.
Therefore, you can skip these operations by wrapping your code with the with torch.no_grad():
context.
x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)
y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)
with torch.no_grad():
y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)
Any variable created within the no_grad
context will have requires_grad==False
.
.detach()
¶Some tensors are computed from others, but you may want to consider them constants without computation history (called leaf variables). For that, you can use the .detach()
method.
A = torch.rand(1,2, requires_grad=True)
B = A.mean()
print("B : ", B)
print("B.requires_grad :", B.requires_grad)
print("B.grad_fn :", B.grad_fn)
C = B.detach()
print("\n-- C = B.detach() -- \n")
print("C : ", C)
print("C.requires_grad :", C.requires_grad)
print("C.grad_fn :", C.grad_fn)
Advanced
PyTorch's autograd mechanism differentiates between two types of tensors:
We can use the .is_leaf
property to differentiate between the two types.
A = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
B = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) + 2 # B is the result of an operation (+)
C = 5 * A # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)
When .backward()
is called, only the leaf variables have their gradients stored in their .grad
attribute.
.retain_grad()
¶Advanced
When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables in the computation graph. However, by default, only gradients of variables that were created by the user (leaf) and have the requires_grad
property to True are saved.
Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters (which are leaf variables).
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()
B = 5 * (A + 3)
C = B.mean()
print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()
B = 5 * (A + 3)
B.retain_grad() # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()
print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)
Advanced
You can explore how PyTorch keeps track of history by inspecting the tensor.grad_fn
argument:
print(y.grad_fn)
print(y.grad_fn.next_functions[0][0])
print(y.grad_fn.next_functions[0][0].next_functions[0][0])
Each value has a grad_fn
corresponding to the operation that produced the value.
Each operation's grad_fn
points to its inputs through next_functions
.
For each input, next_functions
contains a tuple of the input's grad_fn
and, if the operation had multiple outputs, an index of the relevant output.
# In our example, the final `add` operation has two inputs:
# - The first is the output of `multiplication`.
# - The second is a constant `4` for which we don't require a gradient.
y.grad_fn.next_functions