In this notebook, we will explain how the auto-differentiation module of PyTorch works. This module is named Autograd.
We will first present you how you can comppute gradient using PyTorch for a specific variable and how to check the value of the gradient. Then we will use the backward function to do the gradient computation. Finally, we will see how to detach a tensor from its computation history and how to tell PyTorch not to keep track of the operations (useful in inference!).
More advanced autograd functions are also explained, but we won't go through them during the workshop.
# execute only if you're using Google Colab
!wget -q https://raw.githubusercontent.com/ahug/amld-pytorch-workshop/master/binder/requirements.txt -O requirements.txt
!pip install -qr requirements.txt
import torch
torch.__version__
When you do operations on Tensors, PyTorch can keep track of the computation graph in order to be able to backpropagate.
To tell PyTorch to record operations performed on a tensor, each tensor has a function called requires_grad_
.
If there’s at least one input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Tensors didn’t require gradients.
Inplace operations are non-differentiable. That is why x.zero_()
gives an error if x requires gradient computation.
For a tensor x, the underlying data is stored in a tensor that is accessible via x.data. If you do an operation on x.data PyTorch does not add the operation to the computation graph.
Each tensor has a property requires_grad
specifying whether the gradient should be computed during backward pass.
The function requires_grad_(bool)
(notice the trailing _ ) is used to change this property.
A = torch.randint(10, (1,2), dtype=torch.float)
print("A : ", A)
print("A.requires_grad :", A.requires_grad)
A.requires_grad_(True)
print("A.requires_grad :", A.requires_grad)
A.requires_grad_(False)
print("A.requires_grad :", A.requires_grad)
Here we will see a simple example of how to compute the gradient of a function automatically with pytorch. We will check that it correspond to what we can compute manually.
Let's look at the function $f(x, y) = \sin\big( \langle x , y \rangle \big)$
X = torch.Tensor([1, 2, 3]).requires_grad_(True)
Y = torch.Tensor([5, 6, 7]).requires_grad_(True)
f = torch.sin(torch.dot(X,Y))
print("f =", f)
We simply need to call the backward function on $f$.
The backward function will automatically compute all the gradients of $f$ wrt. the inputs using the chain rule!
# Gradient is populated by the backward function
print("X.grad :", X.grad)
print("Y.grad :", Y.grad)
f.backward()
print("\n-- Backward --\n")
print("X.grad :", X.grad)
print("Y.grad :", Y.grad)
$f$ can be written as a composite function $f = h \circ g$
$h(z) = \sin(z)$ with derivative $\dfrac{d h}{d z}(z) = \cos(z)$
$g(x, y) = \langle x , y \rangle$ with partial derivatives $\dfrac{\partial g}{\partial x}(x, y) = y$ and $\dfrac{\partial g}{\partial y}(x, y) = x$
Using the chain rule, we can easily get the derivative of $f$ w.r.t. $x$ and $y$:
$\dfrac{d f }{d x} (x,y) = \cos\big( \langle x , y \rangle \big) \cdot y $
and
$\dfrac{d f }{d y} (x,y) = \cos\big( \langle x , y \rangle \big) \cdot x $
dfdx = torch.cos(torch.dot(X,Y)) * Y
print("df / dx = ", dfdx)
dfdy = torch.cos(torch.dot(X,Y)) * X
print("df / dy = ", dfdy)
Success !
A variable that was created by the user and was therefore not the result of any operation is called a leaf variable.
All variables that have the requires_grad
property to False are also considered as leaf variable.
A = torch.Tensor([[1, 2], [3, 4]]).requires_grad_()
B = torch.Tensor([[1, 2], [3, 4]]).requires_grad_() + 2 # B is the result of an operation (+)
C = 5 * A # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)
A variable can have a long computation history, but you may want to consider it as a new leaf variable without history.
For that, you can use the detach
function, which detaches the tensor from its history.
A = torch.rand(1,2)
B = A.mean()
print("B : ", B)
print("B.grad_fn :", B.grad_fn)
print("B.is_leaf :", B.is_leaf)
B.detach_()
print("\n -- B.detach_() -- \n")
print("B : ", B)
print("B.grad_fn :", B.grad_fn)
print("B.is_leaf :", B.is_leaf)
# This won't work since B has no history.
# B.backward()
At inference time, you don't want Pytorch to build a computation graph.
This can be achieved by wrapping your inference code into the with torch.no_grad()
context manager.
x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)
y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)
with torch.no_grad():
y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)
In older versions of PyTorch, one had to wrap a Tensor
into a Autograd object called Variable
.
Variable
was a thin wrapper around a Tensor
object, that also held the gradient w.r.t. to it, and a reference to a function that created it. This reference allowed retracing the whole chain of operations that created the data.
Now, Tensors
are by default Variable
and we don't need to worry about this anymore, but you may still encounter it in some "old" code.
from torch.autograd import Variable
x = Variable(torch.randn(5, 5))
x
The following concepts are more advanced and may want to skip it for now.
We won't go through them, but there are here for you to come back to later when you feel more comfortable with pytorch.
You can also check the Pytorch Doc.
When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables. However, by default, only gradients of variables that were created by the user (leaf) and have the requires_grad
property to True are saved.
Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters.
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()
B = 5 * (A + 3)
C = B.mean()
print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()
B = 5 * (A + 3)
B.retain_grad() # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()
print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)
You can backward a first time and get a gradient for A, then do some other computation using A and then backward again.
Gradients will get accumulated in A.
A = torch.Tensor([[1, 2], [3, 4]]).requires_grad_()
print("A.grad :", A.grad)
B = 5 * (A + 3)
C = B.mean()
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
B = 5 * (A + 3)
C = B.mean()
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
This part is to give a glimpse of how it works under the hood. We don't need to do such inspection in practice.
Here, we have a look at the computation graph that autograd builds on the fly.
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()
B = 5 * (A + A)
C = B.mean()
Each tensor has a gradient function.
print(A.grad_fn)
print(B.grad_fn)
print(C.grad_fn)
We can also "walk" on the computation graph by calling the next_functions
attribute.
grad_fn = C.grad_fn
print(grad_fn)
grad_fn = grad_fn.next_functions
print(grad_fn)
grad_fn = grad_fn[0][0].next_functions
print(grad_fn)
grad_fn = grad_fn[0][0].next_functions
print(grad_fn)