A tutorial

Antoine Prouvost

30th of March 2020

PyTorch Version:


PyTorch is a deep learning framework for modeling and training neural networks, with possible GPU acceleration. It is a very flexible framework with a Python-first interface that is executed dynamically (operations are done when the python instruction is executed) as opposed to other compiled system such as TensorFlow.

The outline of this tutorial is going to be as follows:


You need to know some basics about python, numpy, machine learning, and deep learning. For comprehensive introduction to deep learning, I recommend this series of videos, this one, and fast.ai, but of course there are many other ressources onlines.

  • You can find the PyTorch documentation here
  • Some tutorials here;
  • And the discussions here.

Notebook configuration

This notebook is hosted on Github under MIT license. You can get a nicer display using jupyter nbviewer.


You can run this notebook locally using Python >= 3.6, and an installation of PyTorch. You can use PyTorch without GPU (all the functionalities are supported on CPU), however if you want GPU accleleration for you computation, you need an Nvidia GPU, CUDA, and CUDNN. Installation details are not provided in this tutorial.

On Google Collaboratory

We'll run the notebook on Google Colaboratory. This is Google drive's jupyter notebbok tool. It runs automatically on Google servers and we can also access a GPU, all with very little configuration. You can save file and install anything on the virtual machine, but the machine will be killed after some time. Be sure to download files that you want to keep (there are none in this tutorial).

Open this notebook in Colaboratory.

After opening go to Runtime>Change runtime type and select Python3 and GPU.

All the dependencies should already be installed in Google Collaboratory. If this is not the case, run the installation cell below.


Install Numpy, PyTorch, TorchVision, and Tensorboard. If you already have them installed pip will not upgrade anything.

In [ ]:
!pip install numpy torch torchvision tensorboard

Notebook reminders

Commands that start with % (or %% for entire cells) are called magic commands. They are Jupyter-notebook extensions. For instance to time the execution of a sum of 100 first square numbers, you can use the %timeit (or %%timeit).

In [1]:
sum(i**2 for i in range(100))
19 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Commands that start with ! are shell commands. They are run in a subshell where the jupyter-notebook is running. For instance, to see the files in the current folder, you can run the ls command.

In [2]:
data  img  runs  tutorial.ipynb

You can query the docstring (help) on anything (module, function, class, method, magic,...) using ?. For instance, for the Python built-in min function, you can execute


You can get completion by hitting the [TAB] key. For instance, if you want to know what methods exist on a string (or if you do not remember the exact name of one method), you can write

"Hello world!".

and press [TAB].

Notebook keep a state with all object, classes, and function that you have defined upon executing a given cell. Unless overritten, they remain even after you edit/delete the cell, which can easily be confusing. More than ever, it is important to reduce the number of global variables (for instance by turing cells into functions).

A good rule of thumb for organizing your code is that you should be confident about restarting the notebook at anytime.

1. Link to numpy

Compiled executions graph such as TensorFlow enable some optimization of the computations, but PyTorch remains highly efficient, with underlying operations being done in low level languages (C, Cuda). In practice, being dynamic means that PyTorch will behave like numpy, doing the actual computation when the instruction is exectued. On the contrary, static computation graph use Python to build a description of what to do, and only perform computation later on. As for numpy, we take advantage of vectorized operations by making them over an ndarray. In PyTorch, we call them Tensor but they behave similarly.

In [3]:
import numpy as np
import torch

Checking the Pytorch Version

In [4]:
In [5]:
X = torch.tensor([[1, 2], [3, 4]])
In [6]:
# @ is matrix multiplication
X @ X + X
tensor([[ 8, 12],
        [18, 26]])

Tensors collapsed to numpy array naturally, so they can be used in numpy. We can also explicitely convert from and to numpy:

In [7]:
X_np = X.numpy()
array([[1, 2],
       [3, 4]])
In [8]:
Y_np = np.arange(5)
Y = torch.from_numpy(Y_np)

Beware that using any of these two functions, both and X and X_np, Y and Y_np share the same underlying memory. An example of what this means:

In [9]:
Y[3] = -1
In [10]:
array([ 0,  1,  2, -1,  4])

This is to make efficient use of both frameworks together without having to copy the data every time. For creating a new object with a copy of the memory, simply use np.array and torch.tensor:

In [11]:
tensor([ 0,  1,  2, -1,  4])
In [12]:
array([ 0,  1,  2, -1,  4])
Transfer to numpy:
  • With the same memory: use .numpy() and torch.from_numpy
  • With memory copy: use np.array and torch.tensor

Tensor can be converted to different data types:

In [13]:
tensor([ 0.,  1.,  2., -1.,  4.])

This returns a new tensor with 32 bit floating point data.

The most used are .float(), .int(), .long(), and byte(). One can also use the more general .to(dtype=torch.int64) with any given type from PyTorch. We can query the type of Tensor using Tensor.dtype.

Overall, PyTorch is less rich than numpy in the collections of functions it implements. To find the name of the functions implemented use completion, read the documentation, or search other ressourses online.

2. GPU acceleration

GPUs are processing units with many core able to do small simple operations in parallel, together with a very fast access to the (GPU) memory. We can use it to parallelize our Tensor operations such as linear algebra. Neural networks make heavy use of tensor operations and get a nice speed-up with GPUs. Most of the PyTorch functions can be executed on GPU.

Initailly GPU where only intended for graphical processing. With deep learning, GPU for general computing are on the rise, with Nvidia dominating the market. This is because of its propietary CUDA framework that let program its GPUs in an efficient fashion. Nowadays all deep learning frameworks run on CUDA. You need to get an Nvidia GPU and install CUDA (+ CUDNN) if you want to get your own hardware.

Hopefully we'll eventually get frameworks that are hardware independent, maybe using OpenCl as is done in PlaidML.

First let's start by seeing if we have GPUs availables, and the number

We use torch.cuda.is_available() to know if there are Cuda compatible GPUs and torch.cuda.device_count() to know the how many of them.

In [15]:
In [16]:

To do computation on a GPU, we need to have a tensor in the memory of that GPU, which is not the same across different GPUs, and not the same as the CPU RAM. Once this is done, computation will happen naturally on the associate device.

To get a copy of a Tensor on a GPU, we'll just use Tensor.to(device) where device is:

  • An int for the index of the GPU;
  • Eventually "cuda:0" for the GPU, here with index 0;
  • "cpu" for the RAM/ CPU computing;
  • A torch.device object that is just a wrapper around the above.
Alternatively, Tensor.to() can also be used
  • With a torch.dtype to do return a Tensor in another type;
  • With a Tensor to get a copy of the original Tensor on the same device and with the same type as the Tensor passed as argument.

We use Tensor.device to query the device of a Tensor.

In [17]:
X = torch.tensor([[1., 2.], [3., 4.]])
In [18]:
X_cuda0 = X.to("cuda:0")
In [19]:
In [20]:
device(type='cuda', index=0)
In [21]:
# Computation done on GPU, result tensor is also on GPU
X_2 = X_cuda0 @ X_cuda0
In [22]:
tensor([[ 7., 10.],
        [15., 22.]], device='cuda:0')

We cannot do cross devices opertions:

In [23]:
    X + X_cuda0
except RuntimeError as e:
expected device cpu but got device cuda:0

Operations done on a GPU are asynchonous. Operations on a GPU are enqueued to be performed in the same order. This allow to fully leverage multiple GPUs and CPU at the same time. This is done automatically for the user and synchonization operations are performed by PyTorch when required: copying a Tensor from on device to another (including printing). In practice, this means that if you put a CPU operation that waits on the GPU in the middle of your network, you will not fully utilize your CPU and GPU.

There are other ways to manage the device on which computation is done, including Tensor.cuda(device) and Tensor.cpu() but Tensor.to(device) is the more generic and let us write device agnostic code by simply changing the value of the device variable.

It is also possible to create a Tensor directly on a specific device, but this is limited to zeros, ones, and random initialization. This is for more advanced cases that we won't need in this tutorial.

More information can be found in the documentation.

[Advanced] Sharing GPUs

Using PyTorch you cannot share a GPU with other processes. If you share a machine with multiple GPUs with other users without a job scheduler, you might end up getting conflicts.

Your PyTorch code should always assume contiguous GPUs indexes starting from 0. Then, when running your job, GPU 0 may not be available (run nvidia-smi to see GPU availability). You would then run your code with the environement variable CUDA_VISIBLE_DEVICES set to the GPUs you want to use, for instance CUDA_VISIBLE_DEVICES=2 to use only GPU 2 and CUDA_VISIBLE_DEVICES=1,3 to use GPU 1 and 3. In PyTorch you will see them as 0, 1,... It will also adjust the result from torch.cuda.device_count() etc.

3. Automatic differentiation

PyTorch is able to numerically compute gradients using reverse mode automatic differentiation, aka backprop' for the cool kids and chain rule for the mathematicians.

$$h: x \mapsto g(f(x))$$$$h': x \mapsto g'(f(x)) \times f'(x)$$

Or with $y = f(x)$ and $z = g(f(x))$:

$$ \frac{\mathrm{d}z}{\mathrm{d}x} = \frac{\mathrm{d}z}{\mathrm{d}y} \frac{\mathrm{d}y}{\mathrm{d}x} $$

In multiple (input) dimensions, this is: $$ \frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \frac{\partial y_j}{\partial x_i} $$

To do this automatically, PyTorch keeps track of the computation performed using a computation graph (built dynamically). When computig gradients, PyTorch sucessively applies the chain rule to every edge, starting from the output. Here is an example of a compute graph for a two layers perceptron from Deep Learning [Goodfellow et al. 2016].

Computing derivatives in reverse mode requires for PyTorch to remember the jacobians, which is memory intensive.

PyTorch computes the derivates only where required, which is set to nowhere by default.

In [24]:
import torch
import torch.autograd as autograd
In [25]:
X = torch.tensor([1, 2, 3], dtype=torch.float32)
l2_norm = X.norm()
In [26]:
    # Gradient of the l2_norm of X, with respect to X
    autograd.grad(l2_norm, X)
except RuntimeError as e:
element 0 of tensors does not require grad and does not have a grad_fn

Indeed, we did not specify that X was to require a gradient:

In [27]:

To specify that X will require a gradient, we need to specify it using the method .requires_grad_. This let PyTorch know that we want to compute gradient with regard to this variable and that it should do what is necessary in order to do so.

In [28]:
X = X.requires_grad_()
l2_norm = X.norm()
In [29]:
# Gradient of the l2_norm of X, with respect to X
autograd.grad(l2_norm, X)
(tensor([0.2673, 0.5345, 0.8018]),)

Note that all Tensor that depend on X also need to recieve a gradient for the backpropagation to work, so PyTorch will set them automatically:

In [30]:

Also note that after backpropagating, PyTorch will free the computation graph, so we cannot reuse it as is.

In [31]:
    # Try to backpropagate throught the graph a second time
    autograd.grad(l2_norm, X)
except RuntimeError as e:
Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

The gradients are often computed in a object oriented mode using .backward(). This will computed the gradients of everything in the computation graph that has requires_grad=True. The gradients are stored in the .grad attribute of the Tensor. That way they will be accessed for the gradient descent without having to scpecify all the gradients one by one.

In [32]:
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)
l2_norm = X.norm()
In [33]:
tensor([0.2673, 0.5345, 0.8018])

To reuse a piece of code without storing unecessary jacobians, we can use the following context manager (piece of code that start with the with keyword). This is convenient to use a neural network without further training. The context manager will let us reuse the extact same code during training and inference.

In [34]:
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)

with torch.no_grad():
    except RuntimeError as e:
element 0 of tensors does not require grad and does not have a grad_fn

We can also use detach() to disconnect the computation graph

In [35]:
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)

# Gradient of square norm of X
autograd.grad(X.norm() * X.norm(), X)
(tensor([2., 4., 6.]),)
In [36]:
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)

# Gradient of norm of X, time norm of X
autograd.grad(X.norm().detach() * X.norm(), X)
(tensor([1., 2., 3.]),)

Overall the default PyTorch mecanisms are defined according to how gradients are used in neural networks. If you want to know more about automatic differentiation in PyTorch you can look at this tutorial.

4. Api for neural networks

Recall that if $f_\theta$ is a neural network parametrize by $\theta$ (the weights), we optimize over $\theta$ so that the network behave properly on the input points. More precisely, we have a loss function per-example $\mathcal{L}$ (e.g. categorical cross entropy for classification or mean square error for regression) and want to minimize the generalization error:

$$ \min_{\theta} \mathbb{E}_{x, y} \ell(f_{\theta}(x), y) $$

Where $x$ and $y$ follow the unknown true distribution of data. To make it tracktable, we approxiamte this loss by the empirical training error (on the training data):

$$ \min_\theta \quad \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i) $$

Because computing $f_\theta(x_i)$ and $\nabla_\theta f_\theta(x_i)$ for every $i$ is expensive, we compute it on a subset of the data points to make a gradient descent step over $\theta$. This is known as the stochastic gradient descent algorithm. It is stochastic from the fact that we compute the gradient of the loss function only over a mini batch.

The points sampled to estimate the gradient are known as the batch (or mini batch) and it size is the batch_size. We change the batch after every gradient step. We usually sample without replacement. Once all the points have been sampled, we start again a new loop. This is known as an epoch.

We monitor the loss on a validation set during training and evaluate the final model on a test set.

If the model is not able to fit the training data, its capacity is too low and the model is said to underfit. If it fits the data nicely but do not generalize to unseen examples, the model is said to overfit and regularization) can be used to mitigate it.

4.1 Data and dataloader

In machine learning we represent our input points in a higher dimension tensor, where the first (usual convention) dimension is the index, other are features dimensions. For instance, if we have data for predicting the price of an appartment

size downtown renovated parking
30 false true true
10.4 false true false
50 true false true

And the target vector


Would be represented as:

In [37]:
X = torch.tensor([
    [30,   0, 1, 1],
    [10.4, 0, 1, 1],
    [50,   1, 1, 1]
Y = torch.tensor([89.6, 56, 10])

In the case of y, the second dimension (dim=1) has only one feature (the price) so we don't need to add a second axis for the vector.

Depending on the application, the features dimensions can be organised in more than one dimension to better use the structure. For instance, if we had images, we would have two features axis, hence havinga three dimensional tensor 0: index, 1: x-axis, and 2: y-axis. If we had movies, we would have a four dimensional tensor 0: index, 1: x-axis, and 2: y-axis, 3: time. More advanced structured inputs (sequences, graphs, ...) require more carefully designing the tensor.

In traditional machine learning, we can pass the whole X and y tensor to an algorithm (linear regression, random forrest, SVM). In deep learning, we have way more inputs and use stochastic gradient descent.

A pseudo code for a simple gradient descent would look like the following.

for epoch in range(max_epoch):
    n_batchs = X.size(0) // batch_size + 1
    for i in range(n_batchs):
        X_batch = X[i*batch_size: (i+1)batch_size]
        y_batch = y[i*batch_size: (i+1)batch_size]

        ##### Detailed later
        # compute predictions
        # compute loss
        # compute gradients
        # apply gradient step

Notice we didn't truly go through the truble of sampling properly. We just assume the matrix is shuffled and reuse the same order (in practice this is acceptable).

PyTorch introduces the Dataset and DataLoader classes to do that work. The idea behind the dataloader is that the data might not hold in memory (or GPU memory) so it will load it only as necessary. Even when that is not an issue, the dataloader will come handy in order to avoid rewiting the strategies for sampling datapoints.

Note: in neural networks, we use float to represent the data, even if it is integer, because we want to do operation with the weights, which are floats.


The Dataset class is just a representation of our data. We have to implement the __len__(number of examples) and the __getitem__ (return the example number i, doesn't need to support slice).

In [38]:
from torch.utils.data import Dataset

class MyDataset(Dataset):

    def __init__(self, X, Y):
        assert len(X) == len(Y)
        self._X = X
        self._Y = Y

    def __len__(self):
        return len(self._X)

    def __getitem__(self, i):
       return self._X[i], self._Y[i]

We can use it with our data:

In [39]:
ds = MyDataset(X, Y)
(tensor([30.,  0.,  1.,  1.]), tensor(89.6000))

One can use the opportunity of the Datatset class to read example from disk, either one by one in __getitem__, or all at once in __init__. It can also be used to generate new one one the flight.

In the case of simple Tensor it is so straightforward that there is a factory function to do it for the previous dataset:

In [40]:
from torch.utils.data import TensorDataset

ds = TensorDataset(X, Y)
(tensor([30.,  0.,  1.,  1.]), tensor(89.6000))


The DataLoader combines a Dataset, a Sampler, and a batching function collate_fn to form batches ready to use by the neural network.

The sampler represents the sampling strategy, it outputs indexes passed to the Dataset. In practice, it can be constructed automatically by the DataLoader.

  • The collate_fn takes mutliple examples anf form a batch. In most case this is simply concatenating but it can be changed for more advanced behaviors.
  • The DataLoader also has other useful paramters such as the number of workers. All is explained in the documentation of the function.

In [41]:
from torch.utils.data import DataLoader

data_loader = DataLoader(ds, batch_size=2, shuffle=True)

for epoch in range(3):  # epoch loop
    print(f"Epoch: {epoch}")
    for batch in data_loader:  # mini batches loop
        x, y = batch
        print("\t", y)

        # compute loss and gradients
Epoch: 0
	 tensor([56., 10.])
Epoch: 1
	 tensor([10., 56.])
Epoch: 2
	 tensor([56.0000, 89.6000])

We see that the dataloder exposed served the batch (of size 2) one by one in a random order (reshuffling between the three epochs. The last batch of every epoch is of size 1 because they weren't enough example to form a full mini-batch. That is usually not a problem but can be controlled with the drop_last option of the DataLoader.

In this section we showed how mini-batches are handled in deep learning and presented the PyTorch convenient way of iterating through them. Using them is optional but reduce the amount of code to write.

Famous deep learning dataset come with their own class that download the data on the first use, save it to disk, and read it.

More information on data loading can be found in this tutorial.

4.2 Models (the actual neural network)

In a feed forward neural networks, we alternate between layers of different sizes. More complex networks have more complex operations (convolutions, recurrent, gated, attention, stochastic, ...), but in the end, we represent everything as a layer. For instance we will say that a matrix multiplication is a Linear layer, that an elementwise non linearity (such as ReLU) is another.

In PyTorch, we will call our layers and networks Modules. A Module can be a layer or a mix of other Modules. You can write a Module to do many things as long as it can be expressed in PyTorch and that a gradient can be computed or estimated.

In PyTorch, all layers and models are instance of the same class Module. This is because just one layer can be a neural network; and combining neural networks also makes a neural network.

In the following, remember that a Module (a layer or a neural network) is just a sucession of operations, or simply a function. The goal of the Module class is simply to keep track of some objects and to interfact seamlessly with the rest of PyTorch.

Let's make a simple feed forward neural network with a couple of Module (or layers).

In [42]:
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(4, 20),   # from 4 input neurons to 20 hidden neurons
    nn.ReLU(),          # an elementwise non-linearity
    nn.Linear(20, 20),  # this is a `Module`
    nn.ReLU(),          # this is also a `Module`
    nn.Linear(20, 1)    # output layer as one neuron for our regression

That's a neural network with two hidden layers, each of one having 20 neurons. The input layer has 4 entries (that was the number of features of our X), that mean that we need a linear transformation mapping $\mathbb{R}^4$ to $\mathbb{R}^{20}$.

The whole network is a Module:

In [43]:
isinstance(model, nn.Module)

We can use it on our data:

In [44]:
        [4.8541]], grad_fn=<AddmmBackward>)

Linear and ReLU are also Modules, and we can use them on some data as well:

In [45]:
issubclass(nn.Linear, nn.Module)
In [46]:
module = nn.Linear(4, 20)
tensor([[-15.2087, -13.6489,  -8.8253,  -6.6724,   1.9840,  11.1273,  -1.1347,
          -6.9154,   1.0197, -12.3396,   5.0348,  -9.3840, -11.5080,  -3.6012,
          -7.4400,  -8.7197,   6.0067,   7.8242,  -7.7579,  10.9312],
        [ -5.7391,  -4.5603,  -2.9183,  -1.4600,   0.2065,   4.5511,  -0.3335,
          -1.5460,  -0.0746,  -3.9957,   1.8410,  -3.0285,  -4.3156,  -0.9034,
          -2.3375,  -3.0550,   2.5899,   2.5968,  -3.0556,   3.6963],
        [-24.5597, -23.2131, -14.7208, -12.0098,   4.2769,  17.9369,  -2.1907,
         -12.2362,   2.5220, -21.0751,   8.0836, -15.8825, -18.4969,  -6.0201,
         -12.3493, -14.5556,   9.3745,  12.7641, -12.7607,  18.6822]],

A Module has some methods that are the same as a Tensor, such as .to(device), dtype(type), and .float().

The class-oriented API is what you need to use to define more complex models. It let you define how to do the computation using the .forward method.

In [47]:
class MyModel(nn.Module):
    def __init__(self):
        # Initialize parent class, this is mandatory
        # Our previous network
        # Creating these modules initialize them with different 
        # weights. There are different object representing 
        # different variables.
        self.lin1 = nn.Linear(4, 20)
        self.relu1 = nn.ReLU()
        self.lin2 = nn.Linear(20, 20)
        self.relu2 = nn.ReLU()
        self.lin3 = nn.Linear(20, 1)
    # forward is just the method name used by PyTorch.
    # The parent class of `Module` implements __call__
    # to call our `forward` but also needs to do some
    # extra work.
    def forward(self, X):
        # This is just some PyTorch computations.
        # Modules do PyTorch operations
        h = self.lin1(X)
        h = self.relu1(h)
        h = self.lin2(h)
        h = self.relu2(h)
        out = self.lin3(h)
        return out

There is no dark magic in the code above, it's simply a bunch of class initializing some Tensors, and then making some operations on the input X. Same forward computation could be written in numpy (but we're going to need to compute some gradients). This works the same as before

In [48]:
model = MyModel()
# The result is not supposed to be the same because
# our linear `Module`s are initialized randomly.
        [-2.1556]], grad_fn=<AddmmBackward>)

Actually there's bit more than just the Tensor operations here. The reason why initialzing the nn.Module in the __init__ is because PyTorch will keep track of all the Modules set as attributes of the object for us. That way, when calling methods that need to operate on all submodules, such as .float() or .to(device) (or .train(), eval(), parameters(), apply() seen later), PyTorch can apply this method recursivly for us.

When setting to a Module an attribute containing another Module, PyTorch will keep track of them to apply some methods recurively. The attribute must me a Module (not a list or dict of modules). However nn.ModuleList and nn.ModuleDict can be used to create a Module out of other Modules and behave like a list and dict respectively.

You're free to extend nn.Module the way you want, add other methods etc.. You can have more parameters, options in __init__ and forward. Usually you would put the size of you layers and other hyperparameters in your __init__ arguments.

Notice how in forward, we just feed the result of the previous Module to the next one? That's exactly what the Sequential model from before does, it's just a shorthand.

Let's dig deeper. Our ReLU doesn't have any parameters, so we don't need to keep track of it, we could simply have only one:

In [49]:
class MyModel(nn.Module):
    def __init__(self):

        self.relu = nn.ReLU()
        self.lin1 = nn.Linear(4, 20)
        self.lin2 = nn.Linear(20, 20)
        self.lin3 = nn.Linear(20, 1)

    def forward(self, X):
        h = self.lin1(X)
        h = self.relu(h)
        h = self.lin2(h)
        h = self.relu(h)
        out = self.lin3(h)
        return out

Or even:

In [50]:
class MyModel(nn.Module):
    def __init__(self):

        self.lin1 = nn.Linear(4, 20)
        self.lin2 = nn.Linear(20, 20)
        self.lin3 = nn.Linear(20, 1)

    def forward(self, X):
        h = self.lin1(X)
        h = torch.max(h, 0)
        h = self.lin2(h)
        h = torch.max(h, 0)
        out = self.lin3(h)
        return out

We can also replace the torch.max by a function of only one input

In [51]:
import torch.nn.functional as F

# Same as torch.max(X, 0)
tensor([[30.0000,  0.0000,  1.0000,  1.0000],
        [10.4000,  0.0000,  1.0000,  1.0000],
        [50.0000,  1.0000,  1.0000,  1.0000]])

In the case of ReLU, this just make the code slighly more readable, but other function that we'll use later are more and compicated. It's also an opportunity for the developpers to optimize the code behind.

You can have a look at the code for Linear here (it's not hard to read). You'll see that it's very alike our own Module. The class just initialize Parameters in __init__, and use them to do the affine transformation in forward

F.linear is just a function without internal Parameter that return W @ X.t() + b. This is known has the functional API, it achieves the same as the nn.Module API but in a stateless way (pure functions).

An other important aspect to keep in mind is whether the model is in training mode or evaluation mode. Certain nn.Module behave differently depending on the case. For instance, dropout is a regularization technique that will randomly put some output of a layer to zero. When the model is evaluated, all outputs are used (there is also a scaling so that the expectation stay the same at train ad test time).

nn.Module keep tracks of that for us. All we need to do is say which mode we want and it will propagate to all the nn.Module in our network.

We use .train() to put the model in training mode and .eval() to put it in evaluation mode. The .training attribute tells us in which mode the network is.

In [52]:
# Put `nn.Module` in train mode
# Check if `nn.Module` is in training mode
In [53]:
# Put `nn.Module` in train mode
# Check if `nn.Module` is in training mode

If you use the functional API (torch.nn.functional as F) in you .forward method, you can pass self.training to the function that behave differently at train and test time.

In [54]:
# When using Dropout at train time, a scaling is applied
# to keep the same activation expactation downward.
F.dropout(X, training=True)
tensor([[ 60.,   0.,   0.,   0.],
        [  0.,   0.,   2.,   2.],
        [100.,   0.,   2.,   2.]])
In [55]:
# no dropout at evaluation time
F.dropout(X, training=False)
tensor([[30.0000,  0.0000,  1.0000,  1.0000],
        [10.4000,  0.0000,  1.0000,  1.0000],
        [50.0000,  1.0000,  1.0000,  1.0000]])

In our forward loop, applying dropout before the last layer would look like this:

def forward(self, X):
    h = self.lin1(X)
    h = torch.max(h, 0)
    h = self.lin2(h)
    h = torch.max(h, 0)
    h = F.dropout(h, training=self.training)
    out = self.lin3(h)
    return out

[Advanced] Parameters

The neural netwok is a function of both its input $x$ and its parameters (or weights) $\theta$. We're gonna leave the internal weights for PyTorch to manage. That is the goal of the Parameter class. This class wraps a Tensor to let PyTorch know that this tensor needs to be updated. That means both computing its gradient (requires_grad=True) but also making the gradient step $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$.

To know what parameters to update, PyTorch will recursively look in your Module attributes for other Module, Parameter or sequences of the previous.

You need to use Parameter if you implement a very specific type of layer. If you make a network, you can just reuse the layer already available. Even we building new layers, it's often possible to reuse other layers.

Our Linear has some parameters (a linear matrix transformation and a bias vector, so actually an affine transformation) :

In [56]:
for p in nn.Linear(4, 20).parameters():
Parameter containing:
tensor([[-0.0738, -0.4063, -0.4762,  0.4152],
        [-0.4165,  0.2004,  0.4917,  0.2126],
        [ 0.1891, -0.1208,  0.2383, -0.1684],
        [-0.0259, -0.0307, -0.3393, -0.3913],
        [ 0.4377, -0.1525,  0.1949, -0.0407],
        [-0.3942, -0.1994,  0.1849,  0.2000],
        [-0.4184,  0.4102, -0.1622,  0.4625],
        [-0.2503,  0.0517, -0.2454,  0.1183],
        [ 0.4696, -0.2830,  0.2006,  0.2075],
        [ 0.1909,  0.4976, -0.2516, -0.3534],
        [ 0.1083,  0.0916, -0.2155, -0.3752],
        [-0.3225, -0.2905, -0.4984, -0.3459],
        [ 0.1735,  0.2357,  0.1983,  0.4463],
        [ 0.2195, -0.1318, -0.3424,  0.2164],
        [-0.3479, -0.0249, -0.2260,  0.1750],
        [ 0.0906,  0.0586,  0.2264, -0.3914],
        [ 0.2839, -0.4401,  0.1518, -0.3271],
        [ 0.1169, -0.0481,  0.1440, -0.3001],
        [-0.1690, -0.3332, -0.4675,  0.3139],
        [-0.2580, -0.3231, -0.4955, -0.4806]], requires_grad=True)
Parameter containing:
tensor([ 0.2168, -0.1561,  0.3519,  0.1316,  0.1077, -0.0440,  0.3274,  0.4643,
        -0.1312, -0.0610, -0.0377,  0.3464,  0.0214, -0.2595,  0.3406,  0.1708,
        -0.1816, -0.3534,  0.0196,  0.0814], requires_grad=True)

If you re-execute, you will see different values, that is because it's a different object. If you want to tie weights in you network, you can reuse the same Module object.

ReLU doesn't have any parameters:

In [57]:
for p in nn.ReLU().parameters():

Our model has all the parameters of its submodules because this method is recursive. Let's print only one:

In [58]:
Parameter containing:
tensor([[-0.0746, -0.3022, -0.0724,  0.2296],
        [-0.1943,  0.4038,  0.4587,  0.0772],
        [-0.3246, -0.0725,  0.2456, -0.2552],
        [-0.0633, -0.4671, -0.3263, -0.4756],
        [ 0.0261,  0.1410, -0.2558, -0.4628],
        [-0.2547, -0.4481, -0.4605, -0.4946],
        [-0.4999, -0.3160, -0.2963,  0.2286],
        [-0.1822, -0.0750,  0.2529, -0.3423],
        [ 0.1483,  0.2040,  0.0973, -0.4850],
        [-0.3051, -0.4447,  0.4696, -0.2583],
        [ 0.4302, -0.1996, -0.2780,  0.1108],
        [ 0.2632,  0.3522, -0.3183, -0.2519],
        [-0.4305, -0.0214, -0.1234,  0.4299],
        [-0.4941, -0.3000,  0.1891, -0.1898],
        [ 0.3007,  0.4173,  0.4907, -0.0691],
        [ 0.3537, -0.0193,  0.3667, -0.0845],
        [-0.3454, -0.3124,  0.4312,  0.1642],
        [-0.2445,  0.3376,  0.0071,  0.3831],
        [-0.1442, -0.3874, -0.3315, -0.0485],
        [-0.3004, -0.3057, -0.0961,  0.1880]], requires_grad=True)

A good practice is to define a function to initialize the Parameters of the module.

The Module.apply method let us apply a function on every module (hence submodule) in a Module. To write a initialization function, we can used the initializations provided in torch.nn.init

Remember, that the function will be given all, module, including the main one, so we have to filter for the submodules that we wish to initialize.

In [59]:
# `.weight` is the name of the attribute used in
# the `Linear` layer. There is also `.bias`.
# This may be is different for other layers with `Parameter`s.
lin = nn.Linear(3, 3)
Parameter containing:
tensor([[-0.2010,  0.5769,  0.1207],
        [ 0.2950, -0.2763,  0.4994],
        [ 0.4017,  0.2964,  0.4787]], requires_grad=True)

Therefore we can use

In [64]:
def my_init(m):
    if isinstance(m, nn.Linear):
        nn.init.constant_(m.bias, 0)
In [65]:
  (lin1): Linear(in_features=4, out_features=20, bias=True)
  (relu1): ReLU()
  (lin2): Linear(in_features=20, out_features=20, bias=True)
  (relu2): ReLU()
  (lin3): Linear(in_features=20, out_features=1, bias=True)

Calling .apply(func) is the same as doing:

In [66]:
for m in model.modules():

State and serialization

Getting a frozen copy of your model and being able to save it to the disk is important for many reasons

  • Once you ahe trained a model, you want to store it and use it on some task;
  • During training, you want to keep the model (Parameters) performing the best on the validation set;
  • During long training, you want to be able to restore your optimization in case the jobs gets killed (especially useful on cluser with low priority computing).

The method .state_dict() of a Module will return a dictionary with all inside information necessary to recover the state of the model (mostly the value of inner Parameters. The method .load_state_dict(dict) will load that state

The method torch.save can be used to serialized (save to disk) torch objects. We can pass it the result from .state_dict(). Similarilly, the method torch.load

It is possible to used the pickle module directly on torch objects (open file as binary). It is also possible to serialize the whole Module class instead of the its state, but there are some caveats.

The state returns by .state_dict() shares an underlying memory with the Module Parameters. This mean if you keep the states in python, they will keep changing as you optimize your model. To avoid that, you can save them to disk or make a deep copy the deepcopy from the copy python module.

In [67]:
lin = nn.Linear(3, 3)
state = lin.state_dict()
              tensor([[ 0.0870,  0.3715, -0.2252],
                      [-0.0814,  0.3126, -0.0722],
                      [ 0.4684,  0.0365, -0.3016]])),
             ('bias', tensor([ 0.0319,  0.5305, -0.1515]))])
In [68]:
# Our state and `Module` parameters are linked.
state["weight"][:] = 0
Parameter containing:
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], requires_grad=True)

We use it the following way:

In [69]:
lin = nn.Linear(3, 3)
# Save state dict.
torch.save(lin.state_dict(), "/tmp/my_model_state")
In [70]:
# Modify model.
lin.weight.data[:] = 0
Parameter containing:
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], requires_grad=True)
In [71]:
# Reload the parameters.
# Parameters recovered:
Parameter containing:
tensor([[ 0.4257,  0.0643, -0.0683],
        [-0.4822, -0.1999, -0.3269],
        [ 0.5627, -0.2880,  0.4859]], requires_grad=True)

Moving a model to GPU

Because a torch.nn.Module holds Pararameters (which are Tensors), it is important to move the model on the device on which we want to do our computations.

Use model.to(device) to move the nn.Module on the given device. Compared to Tensor.to(device) this function moves the current model on the device and returns itself.

[Extra] CNN and RNN layers

If you are doing deep learning, there are good chances you are doing computer vision or natural language processing. In each case, you will need to master the classical convolution layers and recurrent layers respectively.

This layers, and other, works the same as what we have presented, but assume a more structure on their input data. Convolutional layers need the data to be structured with the width, height, and number of channels (e.g. RGB for input images). Recurrent layers need a time dimension (attention, by default this is the first one, even before the batch dimension).

[Advanced] Parameters/layer sharing

CNN, RNN, and other type of layers that you can build will use weight sharing, meaning that some weight are mathematically the same in multiple operations. For instance in a RNN, the weights are reused at every time step. To do this in PyTorch, just do what you would intuitively do: reuse the same python object (A Tensor or a nn.Module). In the forward pass, the same value will be used, while in the backward pass, gradients from different child in the compute graph will be summed, as it should be according to the chain rule.

A simple example a neural network that has the same weights for 5 hidden layers:

In [72]:
class WeightSharingNetwork(nn.Module):
    def __init__(self):

        self.lin1 = nn.Linear(4, 20)
        self.lin2 = nn.Linear(20, 20)
        self.lin3 = nn.Linear(20, 1)

    def forward(self, X):
        # First layer 4 -> 20
        h = F.relu(self.lin1(X))
        # lin2 weights reused five times 20 -> 20
        for _ in range(5):
            h = F.relu(self.lin2(h))
        out = self.lin3(h)
        return out
weight_sharing_network = WeightSharingNetwork()

Everything works normally

In [73]:
# Forward OK
out = weight_sharing_network(X)
# Backward OK

4.3 Optimizers

Now that we have an api for neural networks and automatic gradients computation for the parameters, gradient descent is going to be as easy as $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$.

PyTorch provides different optimizer that inherits from the torch.optim.Optimzer base class. The most simple one is optim.SGD which does exactly the update mentioned previously. optim.Adam (ref here) is a nice go-to optimizer.

To instantiate an optimizer, we need to give it the parameters to optimize over, as well as optimization parameters. Thankfully, we've seen how to get all the parameters in our nn.Module.

To create an optimizer, choose an algorithm from torch.optim and pass it the model parameters (torch.nn.Module.parameters), along with other hypermarameters (learning rate...).

More information on optimizer is available in the documentation.

In [74]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
In [75]:
# lr is learning rate, a very important hyperparameter
optimizer = optim.Adam(model.parameters(), lr=1e-3)

Note: that there is a weight_decay ($\mathcal{l}_2$ norm regularization) parameter in optimizer. This is a shorthand, that way, one doesn't need to list all the parameters in the loss function. Be careful, usually one does not regularize biases of the model.

To write an optimization step we use two functions from torch.optim.Optimizer.

We need optimizer.zero_grad() to reset the .grad attribute of our parameters to zero (otherwise they would be sum, this is a feature required by backpropagation). Then compute new gradients, and use optimizer.step() to perform a gradient update.

In [76]:
# This is going to make sure all `.grad` attribute
# in the tensor parameters of our network are reset
# to zero.
# Otherwise, as we keep computing gradients, they
# are summed in this `.grad` attribute.

# Forward pass
y_pred = model(x)

# Compute loss between the predictions of the true labels.
# Here we use the mean square error for regression.
# Note: y_pred had dimension `batch_size x 1`, when y has only
# size `batch_size`. We use `squeeze` to remove the 1-dimensions.
# We could also have added `squeeze` at the end of our `.forwad`
# method.
loss = F.mse_loss(y_pred.squeeze(1), y)

# Compute the gradients of the parameters in the compute
# graph

# the optimizer apply a descent step to the parameters

Note: the optimizer are always minimizing, if you want to maximize, you can take the negative of your loss.

Overall, once we've created our model (nn.Module), our Dataset, our Dataloader, and our Optimizer the training loop looks like this:

In [77]:
for epoch in range(6):  # epoch loop
    print(f"Epoch: {epoch}")
    for batch in data_loader:  # mini batches loop
        model.train()  # make sure the model is in training mode
        x, y = batch
        y_pred = model(x)
        loss = F.mse_loss(y_pred.squeeze(1), y)
        print("\t", f"Loss: {loss.item()}")
Epoch: 0
	 Loss: 2542.23486328125
	 Loss: 2197.262451171875
Epoch: 1
	 Loss: 2548.02490234375
	 Loss: 2201.3642578125
Epoch: 2
	 Loss: 2549.058349609375
	 Loss: 2192.867919921875
Epoch: 3
	 Loss: 3123.35400390625
	 Loss: 1061.0064697265625
Epoch: 4
	 Loss: 3098.276611328125
	 Loss: 1072.787109375
Epoch: 5
	 Loss: 3089.1875
	 Loss: 1071.0848388671875

This loop is not supposed to converge because the data is ultra small, every design decision is random etc.

This double loop is where the training happens. It can last for days on big models/ datasets. The training needs to be babysitted: during this loop, it important to monitor the perfomances on the training and validation set, save the parameters to restart in case the programs get killed, save the best set of parameters found so far etc.

Monitoring the training loop is fastidious. We'll present some direction in the last section but won't have time to go into the details.

State and serialization

Optimizer also have an inside state (diminishing learning rate, momentums...). State and serialization is exactly the same as for Modules.

5. Some good practices

Some PyTorch:
  • Use device agnostice code: have a function/class parameters device and use Tensor.to(device), torch.rand(..., device=device) etc.
  • Similarily, you can change the data type of a tensor without worrying about the device: don't use torch.cuda.FloatTensor but Tensor.float(), Tensor.to(dtype), torch.arange(..., dtype=dtype) etc., where dtype is somehting like torch.float. PyTorch accepts the device and dtype argument
  • almost everywhere.
  • A good default for both device and device is None. It will keep the current device/type or use the default one when you construct a tensor.
  • Do use the classes Dataset, Dataloder, nn.Module, even if your code is simple. It will be easier to grow your code and increse the readability. Do not make these class do more than they should, keep simple, modular, easy to understand abstractions.
General good pratices:
  • Test your code (unittest, pytest);
  • Use loggers instead of printing;
  • Document your code;
  • Write clear explicit code;
  • Format your code homogeneously (e.g. PEP8, PEP257, black);
  • Use version control (git) and back it up (github);
  • Write a separte script to run experiments, save the all the results and the model;
  • For long training make frequent backup of your model;
  • Keep track of the dependencies of your code;
  • Notebooks are great for quickly experimenting, or viusally presenting results, but terrible for reproducability, modularity, VCS... (Joel Grus, 2018).
My take:
  • Very explicitly separate your code logic from your experiments, for instance:
    • At the root of your project, have a Python package with your models, algorithms, etc.
    • Have separate scripts (in another folder) to run different experiments, or anything else. Note you may need to run them using PYTHON_PATH="." to make your package visible, or better install your package in your virtual environement: pip install -e . and run usingpython sripts/this experiment.py
  • Unpopular: don't use a global argparse to manage all your hyperparameters
  • Actually don't even use argparse to get the hyperparameters. Put them in a config file, with a unique id (e.g. uuid), and save it somewhere. It litteraly takes one line of python to read it into a dictionary. Use the same id to save your results/ models. Keep the command line arguments for things related to the execution settings, such as CPU?GPU, number of GPU/ visdom port...

You wouldn't change font in the middle of a paper; then mind the quality of your code.

6. Exercise: training a neural network for digit recogintion

We're going to train a neural network for digit recognition on the dataset called MNIST. This dataset is small (70000 28x28 images) and considered solved. Being able to fit MNIST is required but not sufficient to claim improvement in image recognition.

To do:
  • Create the dataset classes
  • Create the data loaders
  • Create the model
    • Use .view(-1, 28*28) to reshape images into vectors
    • Use dropout
    • Last should be the numer of classes (10)
  • Write the training loop
    • Classification loss
    • Use the GPU (with .to(device)). Don't move the whole dataset to GPU but on;y your current batch.
  • Add evalutation of the model and the training and validation set at every epoch (or so)
    • Remember to use .eval() on your model and with torch.no_grad()
    • Measure the loss function and the accuracy
    • keep track of the best parameters (with state_dict)
  • Cross validate different parameters and models
    • Tune the learning rate
    • Tune the batch size
    • Tune the number of epochs
    • Tune the sizes and number of layers
    • Try using convolution, pooling, spatial dropout, and batchnorm

Numerical stability: In theory, we need to use a softmax on the last layer to get a probability distribution over classes, then use the cross entropy loss function. In practice this is numerically unstable, so the cross entropy function takes directly the linear activations. Another possibility is to use the log-softmax on the last layer and then use the nll-loss as loss function.

Note that this function come both in the nn.Module API or the torch.nn.functional API.

torchvision is a package with some utilities for image recognition tasks. It implements the Dataset class for MNIST. This class is quite nice, it will load the dataset from a given disk location and do anything that is necessary for serve us the Tensors. It will also download the dataset to the disk location if it is not found :)

In [ ]:
import random
from typing import Dict, Mapping, Iterable

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader, Subset
from torch.utils.tensorboard import SummaryWriter

# Set random seed for reproducability
In [ ]:
# This is all the data available for training according to MNIST.
# Because we will use a validation dataset, this data is both the
# training and validation data.
train_valid_dataset = datasets.MNIST(
    'data',  # Path where we're sotring the dataset
        transforms.Normalize((0.1307,), (0.3081,))

# We split the dataset in validation and training data.
# List of all training and validation indices shuffled deterministically
train_valid_indices = list(range(len(train_valid_dataset)))

# Number of element in the validation set
n_valid = 15000

# Indices in the list from n_valid to the end are for training.
train_dataset=Subset(train_valid_dataset, train_valid_indices[n_valid:])
# Indices in the list from the begining tp n_valid are for validation.
valid_dataset=Subset(train_valid_dataset, train_valid_indices[:n_valid])

# The test dataset does not need to be split.
test_dataset = datasets.MNIST(
        transforms.Normalize((0.1307,), (0.3081,))

The transforms are just a way to do dynamic transofrmation on the data before passing it to the neural network. The first one is to say we want the image as a Tensor (otherwise it's plain image not usable by PyTorch but convinient for visualizations), the second is a rescaling of the pixels values so that they'll end up in a nice range (in ML we always need to rescale the inputs, usually in the $[-1, 1]$ or $[0, 1]$ range).

Validation: You should not measure anything on the test set until you've finished all training (not even picking the best model on the test set), otherwise it is p-value hacking.

We'll do a validation set that you can use for measure performances during training and tryout different hyperparameters. This makes use of the Subset random sampler.

You can look at the PyTorch example of MNIST for inspiration. Note that this example does not use a validation set, and uses convolution, not presented here.

You can also use the code structure below. Feel free to adapt it to your thinking. This is the way that felt natural for me it may not be the best for you. If you do use it, I recommend starting with train_model, as it is the main function running. It will help you figure out what goes in the other functions.

In [ ]:
class ConvNeurNet(nn.Module):
    """Convolutional Neural Network for MNIST classification."""

    def __init__(self) -> None:
        """Initialize the layers."""

    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        """Forward pass of the neural network.

            A tensor with shape [N, 28, 28] representing a set of images, where N is the
            number of examples (i.e. images), and 28*28 is the size of the images.

            Unnormalize last layer of the neural network (without softmax) with shape
            [N, 10], where N is the number of input example and 10 is the number of
            categories to classify (10 digits).


def train_model(
    model: nn.Module,
    train_loader: DataLoader,
    valid_loader: DataLoader,
    optimizer: optim.Optimizer,
    n_epoch: int,
    device: torch.device,
) -> None:
    """Train a model.

    This is the main training function that start from an untrained model and
    fully trains it.

        The neural network to train.
        The dataloader with the example to train on.
        The dataloder with examples used for validation.
        The optimizer initialized with the model parameters.
        The number of epoch (iteration over the complete training set) to train for.
        The device (CPU/GPU) on which to train the model.

    # For using Tensorboard
    writer = SummaryWriter(flush_secs=5)
    writer.add_graph(model, next(iter(train_loader))[0])

def update_model(
    model: nn.Module,
    inputs: torch.Tensor,
    targets: torch.Tensor,
    optimizer: torch.Tensor,
) -> torch.Tensor:
    """Do a gradient descent iteration on the model.

        The neural network being trained.
        The inputs to the model. A tensor with shape [N, 28, 28], where N is the
        number of examples.
        The true category for each example, with shape [N].
        The optimizer that applies the gradient update, initialized with the model

        Unnormalize last layer of the neural network (without softmax), with shape
        [N, 10]. Detached from the computation graph.


def accuracy_from_logits(logits: torch.Tensor, targets: torch.Tensor) -> float:
    """Compute the accuracy of a minibatch.

        Unnormalize last layer of the neural network (without softmax), with shape
        [N, 10], where N is the number of examples.
        The true category for each example, with shape [N].

        As a percentage so between 0 and 100.


def evaluate_on_batch(logits: torch.Tensor, targets: torch.Tensor) -> Dict[str, float]:
    """Compute a number of metrics on the minibatch.

        Unnormalize last layer of the neural network (without softmax), with shape
        [N, 10], where N is the number of examples.
        The true category for each example, with shape [N].

        A dictionary mapping metric name to value.


def evaluate_model(
    model: nn.Module, loader: DataLoader, device: torch.device
) -> Dict[str, float]:
    """Compute some metrics over a dataset.

        The neural network to evaluate.
        A dataloader over a dataset. The methdo can be sued with the validation
        dataloader (during training for instance), or the testdataloder (after training
        to cpmpute the final performances).

        A dictionary mapping metric name to value.


def log_metrics(
    writer: SummaryWriter, metrics: Mapping[str, float], step: int, suffix: str
) -> None:
    """Log metrics in Tensorboard.

        A the summary writer used to log the values.
        A dictionary mapping metric name to value.
        The value on the abscissa axis for the metric curves.
        A string to append to the name of the metric to group them in Tensorboard.
        For instance "Train" on training data, and "Valid" on validation data.

    for name, value in metrics.items():
        writer.add_scalar(f"{name}/{suffix}", value, step)
In [ ]:
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
is_device_gpu = (device.type == "cuda")

def make_dataloader(dataset: Dataset, batch_size: int) -> DataLoader:
    """Factory function to create dataloader for different datasets."""
    return DataLoader(

# This batch size influence optimization. It needs to be tuned.
train_loader = make_dataloader(train_dataset, batch_size=64)
# This batch size is just for evaluation, we want is as big as the GPU can support.
valid_loader = make_dataloader(valid_dataset, batch_size=512)
test_loader = make_dataloader(test_dataset, batch_size=512)

model = ConvNeurNet()
optimizer = optim.Adam(model.parameters(), lr=1e-4)


This cell will launch Tensorboard in the notebook to visualize results. You will need to properly use the log_metrics function to see something.

In [ ]:
%load_ext tensorboard
%tensorboard --logdir=runs


Here is one possible solution.

7. Going further

In this section, I'll just give pointers to other features that exists in or around PyTorch. We won't go much in detail but it is useful to know they exist.

Finding models

PyTorch Hub let you easily download the most popular (pretrain) models (see doc).

Alternatively, searching the name of the model, along with PyTorch in a search engine should gives plenny of implementations ready to use and adapt.

Even if you implement a model from scratch, it is a good pratcice to look for implementation already existing as you might learn something about performance, numerical stability etc. reading the code.

Sparse Tensors

PyTorch as limited support for sparse Tensors (including on GPU). Few operations are implemented and it is usually not possible to take the gradient with respect to a sparse Tensor. This is because the gradient of a sparse tensor has no reason to be sparse.

More information in the documentation

Reduce boilerplate

The double for loop (over epoch and batch) can last quite some time (up to days) and it's important to add monitoring, logging, plotting, and checkpointing to it.

Some PyTorch framework exists to avoid maintaining a too big training function, to facilitate code reuse, and to separate the core algorithm from these monitoring operations, while leaving room for customization.

Some general framework do that for you. The most popular are Ignite, Lightning, and TorchBearer.

There exists many more task specific framework, have a look to the PyTorch Ecosystem.

Real time visualization

Training neural networks can be hard, and one needs to visualize what is happening during training to improve on it. For instance, one can vizualize the learning curve (loss function over time / optimization iteration).

Tensorboard is powerful tool, coming from the Tensorflow ecosystem, to monitor many aspects related to neural networks training. It is now a supported by PyTorch using the torch.utils.tensorboard module.

Another option is Visdom. It's more flexible so it takes a bit more time to define your curves. A nice thing is that it is not limited to neural networks, and works with numpy and matplotlib, so you can use it with whatever project you have.

Finally, there are some services (with free academic versions) such as tensorboard.dev, comet.ml, and weights and biases that host your experiments online, on top of providing vizualisation.

Multi GPU (one machine)

The most straightforward way to levrage more parallelism in deep neural network is to use data parallelism (parallelize across the btach dimension). This can be done across multiple GPUs in PyTorch by wrapping the model in the nn.Dataparallel class (tutorial).

PyTorch also has some implementation for multiple machines, for instance using MPI.

Production setting

With the version 1.0, PyTorch improves on its production setting. There is a new C++ interface to PyTorch, along with TorchSript, a subset of python that can be compiled and run outside of Python. Both interface should be compatible but they are still experimental.

It is also possible to export (some) models to another framework using ONNX.


Visit PyTorch documentation and PyTorch tutorials for more features such as tradding memory for compute, and named tensors.