#!/usr/bin/env python
# coding: utf-8

# <!--NAVIGATION-->
# # < [Basics](1-Basics.ipynb) | Autograd | [Optimization](3-Optimization.ipynb) >

# ## Notebook Introduction

# In this notebook, we will explain how the auto-differentiation module of PyTorch works. 
# This module is named **Autograd**.
# 
# We will first present you how you can comppute gradient using PyTorch for a specific variable and how to check the value of the gradient. Then we will use the **backward** function to do the gradient computation. Finally, we will see how to detach a tensor from its computation history and how to tell PyTorch not to keep track of the operations (useful in inference!).
# 
# More advanced autograd functions are also explained, but we won't go through them during the workshop.

# ___

# ## Google Colab only!

# In[ ]:


# execute only if you're using Google Colab
get_ipython().system('wget -q https://raw.githubusercontent.com/ahug/amld-pytorch-workshop/master/binder/requirements.txt -O requirements.txt')
get_ipython().system('pip install -qr requirements.txt')


# ___

# In[ ]:


import torch
torch.__version__


# ### How does Autograd works ?

# When you do operations on Tensors, PyTorch can keep track of the computation graph in order to be able to backpropagate.
# To tell PyTorch to record operations performed on a tensor, each tensor has a function called **`requires_grad_`**.
# 
# If there’s at least one input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Tensors didn’t require gradients.
# 
# Inplace operations are non-differentiable. That is why `x.zero_()` gives an error if x requires gradient computation.
# 
# For a tensor x, the underlying data is stored in a tensor that is accessible via **x.data**. If you do an operation on x.data PyTorch does not add the operation to the computation graph.

# ### Function requires_grad

# Each tensor has a property **`requires_grad`** specifying whether the gradient should be computed during backward pass.
# 
# The function **`requires_grad_(bool)`** (notice the trailing **\_** ) is used to change this property. 

# In[ ]:


A = torch.randint(10, (1,2), dtype=torch.float)
print("A : ", A)

print("A.requires_grad :", A.requires_grad)

A.requires_grad_(True)
print("A.requires_grad :", A.requires_grad)

A.requires_grad_(False)
print("A.requires_grad :", A.requires_grad)


# ###  Backward function

# Here we will see a simple example of how to compute the gradient of a function automatically with pytorch.
# We will check that it correspond to what we can compute manually.

# Let's look at the function $f(x, y) = \sin\big( \langle x , y \rangle \big)$

# In[ ]:


X = torch.Tensor([1, 2, 3]).requires_grad_(True)
Y = torch.Tensor([5, 6, 7]).requires_grad_(True)

f = torch.sin(torch.dot(X,Y))
print("f =", f)


# We simply need to call the __backward__ function on $f$.
# 
# The __backward__ function will automatically compute all the gradients of $f$ wrt. the inputs using the chain rule!

# In[ ]:


# Gradient is populated by the backward function

print("X.grad :", X.grad)
print("Y.grad :", Y.grad)
f.backward()
print("\n-- Backward --\n")
print("X.grad :", X.grad)
print("Y.grad :", Y.grad)


# #### Now let's compute it manually !

# 
# - $f$ can be written as a composite function $f = h \circ g$
# 
#   $h(z) = \sin(z)$ with derivative $\dfrac{d h}{d z}(z) = \cos(z)$
# 
#   $g(x, y) =  \langle x , y \rangle$ with partial derivatives $\dfrac{\partial g}{\partial x}(x, y) = y$ and $\dfrac{\partial g}{\partial y}(x, y) = x$

# -  Using the chain rule, we can easily get the derivative of $f$ w.r.t. $x$ and $y$:
# 
#   $\dfrac{d f }{d x} (x,y) = \cos\big( \langle x , y \rangle \big) \cdot y $
# 
#   and
# 
#   $\dfrac{d f }{d y} (x,y) = \cos\big( \langle x , y \rangle \big) \cdot x $
# 

# In[ ]:


dfdx = torch.cos(torch.dot(X,Y)) * Y
print("df / dx = ", dfdx)


# In[ ]:


dfdy = torch.cos(torch.dot(X,Y)) * X
print("df / dy = ", dfdy)


# Success !

# ### Leaf Variable 

# A variable that __was created by the user__ and was therefore not the result of _any_ operation is called a **leaf variable**.  
# All variables that have the __`requires_grad` property to False__ are also considered as **leaf variable**.

# In[ ]:


A = torch.Tensor([[1, 2], [3, 4]]).requires_grad_()
B = torch.Tensor([[1, 2], [3, 4]]).requires_grad_() + 2  # B is the result of an operation (+)
C = 5 * A  # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)


# ### Detach function

# A variable can have a long computation history, but you may want to consider it as a __new leaf variable__ without history.
# 
# For that, you can use the `detach` function, which detaches the tensor from its history.

# In[ ]:


A = torch.rand(1,2)
B = A.mean()

print("B : ", B)
print("B.grad_fn :", B.grad_fn)
print("B.is_leaf :", B.is_leaf)

B.detach_()
print("\n -- B.detach_() -- \n")

print("B : ", B)
print("B.grad_fn :", B.grad_fn)
print("B.is_leaf :", B.is_leaf)


# In[ ]:


# This won't work since B has no history.
# B.backward()


# ### No_grad function

# At inference time, you don't want Pytorch to build a computation graph. 
# This can be achieved by wrapping your inference code into the __`with torch.no_grad()`__ context manager.

# In[ ]:


x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)

y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)

with torch.no_grad():
    y = (x ** 2)
    print("y.requires_grad : ", y.requires_grad)


# ## Note: Autograd in previous PyTorch versions

# In older versions of PyTorch, one had to wrap a `Tensor` into a Autograd object called `Variable`.
# 
# `Variable` was a thin wrapper around a `Tensor` object, that also held the gradient w.r.t. to it, and a reference to a function that created it. This reference allowed retracing the whole chain of operations that created the data.
# 
# **Now, `Tensors` are by default `Variable` and we don't need to worry about this anymore**, but you may still encounter it in some "old" code.

# In[ ]:


from torch.autograd import Variable

x = Variable(torch.randn(5, 5))
x


# ## Advanced concepts of Autograd 

# The following concepts are more advanced and may want to skip it for now.  
# We won't go through them, but there are here for you to come back to later when you feel more comfortable with pytorch.  
# You can also check the [Pytorch Doc](https://pytorch.org/docs/stable/autograd.html).

# ### Retain Grad

# When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables. However, by default, only gradients of variables that were **created by the user** (leaf) and have the __`requires_grad` property to True__ are saved.
# 
# Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters. 

# In[ ]:


A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
C = B.mean()

print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)


# In[ ]:


A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
B.retain_grad()  # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()


print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)


# ### Gradient accumulation

# You can backward a first time and get a gradient for A, then do some other computation using A and then backward again.  
# Gradients will get accumulated in A.

# In[ ]:


A = torch.Tensor([[1, 2], [3, 4]]).requires_grad_()

print("A.grad :", A.grad)

B = 5 * (A + 3)
C = B.mean()
C.backward()

print("\n-- Backward --\n")
print("A.grad :", A.grad)

B = 5 * (A + 3)
C = B.mean()
C.backward()

print("\n-- Backward --\n")
print("A.grad :", A.grad)


# ### Under the hood...

# This part is to give a glimpse of how it works under the hood. We don't need to do such inspection in practice.  
# Here, we have a look at the computation graph that autograd builds on the fly.

# In[ ]:


A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + A)
C = B.mean()


# Each tensor has a gradient function.

# In[ ]:


print(A.grad_fn)
print(B.grad_fn)
print(C.grad_fn)


# We can also "walk" on the computation graph by calling the `next_functions` attribute.

# In[ ]:


grad_fn = C.grad_fn
print(grad_fn)

grad_fn = grad_fn.next_functions
print(grad_fn)

grad_fn = grad_fn[0][0].next_functions
print(grad_fn)

grad_fn = grad_fn[0][0].next_functions
print(grad_fn)


# ___

# ## Don't forget to download the notebook, otherwise your changes may be lost

# ![Download the notebook](figures/notebook-download.png)

# <!--NAVIGATION-->
# # < [Basics](1-Basics.ipynb) | Autograd | [Optimization](3-Optimization.ipynb) >