PyTorch
is an open source machine learning framework that allows you to write your own neural networks and optimize them efficiently. We choose to teach PyTorch because it is well established, has a huge developer community (originally developed by Facebook), is very flexible and especially used in research. Many current papers publish their code in PyTorch, and thus it is good to be familiar with PyTorch as well.
## Standard libraries
import os
import math
import numpy as np
import time
## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline
We will start with reviewing the very basic concepts of PyTorch. As a prerequisite, we recommend to be familiar with the numpy
package as most machine learning frameworks are based on very similar concepts. If you are not familiar with numpy
yet, don't worry: here is a tutorial to go through.
Install PyTorch
File "<ipython-input-44-9d33066124d3>", line 1 Install PyTorch ^ SyntaxError: invalid syntax
So, let's start with importing PyTorch. The package is called torch
, based on its original framework Torch. As a first step, we can check its version:
import torch
print("Using torch", torch.__version__)
Using torch 1.7.1
In general, it is recommended to keep the PyTorch version updated to the newest one. It is okay to install a lower version. The interface between PyTorch versions doesn't change too much, and hence all code should also be runnable with newer versions.
As in every machine learning framework, PyTorch provides functions that are stochastic like generating random numbers. However, a very good practice is to setup your code to be reproducible with the exact same random numbers. This is why we set a seed below.
torch.manual_seed(42) # Setting the seed
<torch._C.Generator at 0x7f550dcc49d8>
Tensors are the PyTorch equivalent to Numpy arrays, with the addition to also have support for GPU acceleration (more on that later). The name "tensor" is a generalization of concepts you already know. For instance, a vector is a 1-D tensor, and a matrix a 2-D tensor. When working with neural networks, we will use tensors of various shapes and number of dimensions.
Let's first start by looking at different ways of creating a tensor. There are many possible options, the most simple one is to call torch.Tensor
passing the desired shape as input argument:
x = torch.Tensor(2, 3, 4)
print(x)
tensor([[[3.3962e-02, 4.5678e-41, 3.3962e-02, 4.5678e-41], [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00], [1.4013e-45, 0.0000e+00, 1.5502e-09, 3.0894e-41]], [[1.4013e-45, 0.0000e+00, 1.5501e-09, 3.0894e-41], [3.9236e-44, 3.0894e-41, 1.0838e-01, 4.5678e-41], [2.8026e-45, 4.5678e-41, 2.8026e-45, 4.5677e-41]]])
The function torch.Tensor
allocates memory for the desired tensor, but reuses any values that have already been in the memory. To directly assign values to the tensor during initialization, there are many alternatives including:
torch.zeros
: Creates a tensor filled with zerostorch.ones
: Creates a tensor filled with onestorch.rand
: Creates a tensor with random values uniformly sampled between 0 and 1torch.randn
: Creates a tensor with random values sampled from a normal distribution with mean 0 and variance 1torch.arange
: Creates a tensor containing the values $N,N+1,N+2,...,M$torch.Tensor
(input list): Creates a tensor from the list elements you provide# Create a tensor from a (nested) list
x = torch.Tensor([[1, 2], [3, 4]])
print(x)
tensor([[1., 2.], [3., 4.]])
# Create a tensor with random values between 0 and 1 with the shape [2, 3, 4]
x = torch.rand(2, 3, 4)
print(x)
tensor([[[0.8823, 0.9150, 0.3829, 0.9593], [0.3904, 0.6009, 0.2566, 0.7936], [0.9408, 0.1332, 0.9346, 0.5936]], [[0.8694, 0.5677, 0.7411, 0.4294], [0.8854, 0.5739, 0.2666, 0.6274], [0.2696, 0.4414, 0.2969, 0.8317]]])
You can obtain the shape of a tensor in the same way as in numpy (x.shape
), or using the .size
method:
shape = x.shape
print("Shape:", x.shape)
size = x.size()
print("Size:", size)
dim1, dim2, dim3 = x.size()
print("Size:", dim1, dim2, dim3)
Shape: torch.Size([2, 3, 4]) Size: torch.Size([2, 3, 4]) Size: 2 3 4
Most common functions you know from numpy can be used on tensors as well. Actually, since numpy arrays are so similar to tensors, we can convert most tensors to numpy arrays (and back) but we don't need it too often.
# convert numpy to tensor or vise versa
np_data = np.arange(6).reshape((2, 3))
torch_data = torch.from_numpy(np_data)
tensor2array = torch_data.numpy()
print(
'\nnumpy array:', np_data, # [[0 1 2], [3 4 5]]
'\ntorch tensor:', torch_data, # 0 1 2 \n 3 4 5 [torch.LongTensor of size 2x3]
'\ntensor to array:', tensor2array, # [[0 1 2], [3 4 5]]
)
numpy array: [[0 1 2] [3 4 5]] torch tensor: tensor([[0, 1, 2], [3, 4, 5]]) tensor to array: [[0 1 2] [3 4 5]]
Most operations that exist in numpy, also exist in PyTorch. A full list of operations can be found in the PyTorch documentation, but we will see the most important ones here.
The simplest operation is to add two tensors:
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
y = x1 + x2
print("X1", x1)
print("X2", x2)
print("Y", y)
X1 tensor([[0.1053, 0.2695, 0.3588], [0.1994, 0.5472, 0.0062]]) X2 tensor([[0.9516, 0.0753, 0.8860], [0.5832, 0.3376, 0.8090]]) Y tensor([[1.0569, 0.3448, 1.2448], [0.7826, 0.8848, 0.8151]])
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
print("X1 (before)", x1)
print("X2 (before)", x2)
x2.add(x1)
#x2 = torch.add(x1,x2) #Add a scalar or tensor to self tensor
print("X1 (after)", x1)
print("X2 (after)", x2)
x2.add_(x1) # in place version of add()
print("X1 (in place)", x1)
print("X2 (in place)", x2)
X1 (before) tensor([[0.7539, 0.1952, 0.0050], [0.3068, 0.1165, 0.9103]]) X2 (before) tensor([[0.6440, 0.7071, 0.6581], [0.4913, 0.8913, 0.1447]]) X1 (after) tensor([[0.7539, 0.1952, 0.0050], [0.3068, 0.1165, 0.9103]]) X2 (after) tensor([[0.6440, 0.7071, 0.6581], [0.4913, 0.8913, 0.1447]]) X1 (in place) tensor([[0.7539, 0.1952, 0.0050], [0.3068, 0.1165, 0.9103]]) X2 (in place) tensor([[1.3979, 0.9024, 0.6632], [0.7981, 1.0078, 1.0550]])
Another common operation aims at changing the shape of a tensor. A tensor of size (2,3) can be re-organized to any other shape with the same number of elements (e.g. a tensor of size (6), or (3,2), ...). In PyTorch, this operation is called view
:
x = torch.arange(6)
print("X", x)
X tensor([0, 1, 2, 3, 4, 5])
x = x.view(2, -1)
print("X", x)
X tensor([[0, 1, 2], [3, 4, 5]])
Other commonly used operations include matrix multiplications, which are essential for neural networks.
torch.matmul
: Performs the matrix product over two tensors, where the specific behavior depends on the dimensions. If both inputs are matrices (2-dimensional tensors), it performs the standard matrix product. For higher dimensional inputs, the function supports broadcasting.
torch.mm
: Performs the matrix product over two matrices, but doesn't support broadcasting.
# matrix multiplication
data = [[1,2], [3,4]]
tensor = torch.FloatTensor(data) # 32-bit floating point
# correct method
print(
'\nmatrix multiplication (matmul)',
'\nnumpy: ', np.matmul(data, data), # [[7, 10], [15, 22]]
'\ntorch.matmul: ', torch.matmul(tensor,tensor),
'\ntorch.mm: ', torch.mm(tensor, tensor) # [[7, 10], [15, 22]]
)
matrix multiplication (matmul) numpy: [[ 7 10] [15 22]] torch.matmul: tensor([[ 7., 10.], [15., 22.]]) torch.mm: tensor([[ 7., 10.], [15., 22.]])
We often have the situation where we need to select a part of a tensor. Indexing works just like in numpy, so let's try it:
x = torch.arange(12).view(3, 4)
print("X", x)
X tensor([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
print(x[:, 1]) # Second column
tensor([1, 5, 9])
print(x[0]) # First row
tensor([0, 1, 2, 3])
print(x[:2, -1]) # First two rows, last column
tensor([3, 7])
print(x[1:3, :]) # Middle two rows
tensor([[ 4, 5, 6, 7], [ 8, 9, 10, 11]])
One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get gradients/derivatives of functions that we define. We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the parameters or simply the weights.
If our neural network would output a single scalar value, we would talk about taking the derivative, but you will see that quite often we will have multiple output variables ("values"); in that case we talk about gradients. It's a more general term.
Given an input $\mathbf{x}$, we define our function by manipulating that input, usually by matrix-multiplications with weight matrices and additions with so-called bias vectors. As we manipulate our input, we are automatically creating a computational graph. This graph shows how to arrive at our output from our input. PyTorch is a define-by-run framework; this means that we can just do our manipulations, and PyTorch will keep track of that graph for us. Thus, we create a dynamic computation graph along the way.
So, to recap: the only thing we have to do is to compute the output, and then we can ask PyTorch to automatically get the gradients.
Note: Why do we want gradients? Consider that we have defined a function, a neural net, that is supposed to compute a certain output $y$ for an input vector $\mathbf{x}$. We then define an error measure that tells us how wrong our network is; how bad it is in predicting output $y$ from input $\mathbf{x}$. Based on this error measure, we can use the gradients to update the weights $\mathbf{W}$ that were responsible for the output, so that the next time we present input $\mathbf{x}$ to our network, the output will be closer to what we want.
The first thing we have to do is to specify which tensors require gradients. By default, when we create a tensor, it does not require gradients.
x = torch.ones((3,))
print(x.requires_grad)
False
x.requires_grad_(True)
print(x.requires_grad)
True
In order to get familiar with the concept of a computational graph, we will create one for the following function:
$$y = \frac{1}{|x|}\sum_i \left[(x_i + 2)^2 + 3\right]$$You could imagine that $x$ are our parameters, and we want to optimize (either maximize or minimize) the output $y$. For this, we want to obtain the gradients $\partial y / \partial \mathbf{x}$. For our example, we'll use $\mathbf{x}=[0,1,2]$ as our input.
x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients
print("X", x)
X tensor([0., 1., 2.], requires_grad=True)
Now let's build the computational graph step by step.
a = x + 2
b = a ** 2
c = b + 3
y = c.mean()
print("Y", y)
print("b",b)
Y tensor(12.6667, grad_fn=<MeanBackward0>) b tensor([ 4., 9., 16.], grad_fn=<PowBackward0>)
Using the statements above, we have created a computational graph that looks similar to the figure below:
We calculate $a$ based on the inputs $x$ and the constant $2$, $b$ is $a$ squared, and so on. The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.
Each node of the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, grad_fn
. You can see this when we printed the output tensor $y$. This is why the computation graph is usually visualized in the reverse direction (arrows point from the result to the inputs).
We can perform backpropagation on the computation graph by calling the function backward()
on the last output, which effectively calculates the gradients for each tensor that has the property requires_grad=True
:
y.backward()
x.grad
will now contain the gradient $\partial y/ \partial \mathcal{x}$, and this gradient indicates how a change in $\mathbf{x}$ will affect output $y$ given the current input $\mathbf{x}=[0,1,2]$:
print(x.grad)
tensor([1.3333, 2.0000, 2.6667])
We can also verify these gradients by hand. We will calculate the gradients using the chain rule, in the same way as PyTorch did it:
$$\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i}$$Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor. The partial derivatives are:
$$ \frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm} \frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm} \frac{\partial c_i}{\partial b_i} = 1\hspace{1cm} \frac{\partial y}{\partial c_i} = \frac{1}{3} $$Hence, with the input being $\mathbf{x}=[0,1,2]$, our gradients are $\partial y/\partial \mathbf{x}=[4/3,2,8/3]$. The previous code cell should have printed the same result.
A crucial feature of PyTorch is the support of GPUs, short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks.
First, let's check whether you have a GPU available:
gpu_avail = torch.cuda.is_available()
print(f"Is the GPU available? {gpu_avail}")
Is the GPU available? True
If you have a GPU on your computer but the command above returns False, make sure you have the correct CUDA-version installed.
By default, all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function .to(...)
, or .cuda()
. However, it is often a good practice to define a device
object in your code which points to the GPU if you have one, and otherwise to the CPU. Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. We can specify the device as follows:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)
Device cuda
Now let's create a tensor and push it to the device:
x = torch.zeros(2, 3)
x = x.to(device)
print("X", x)
X tensor([[0., 0., 0.], [0., 0., 0.]], device='cuda:0')
In case you have a GPU, you should now see the attribute device='cuda:0'
being printed next to your tensor. The zero next to cuda indicates that this is the zero-th GPU device on your computer. PyTorch also supports multi-GPU systems, but this you will only need once you have very big networks to train.
We can also compare the runtime of a large matrix multiplication on the CPU with a operation on the GPU:
x = torch.randn(5000, 5000)
## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print(f"CPU time: {(end_time - start_time):6.5f}s")
## GPU version
x = x.to(device)
# CUDA is asynchronous, so we need to use different timing functions
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
_ = torch.matmul(x, x)
end.record()
torch.cuda.synchronize() # Waits for everything to finish running on the GPU
print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s") # Milliseconds to seconds
CPU time: 1.31549s GPU time: 0.12480s
Depending on the size of the operation and the CPU/GPU in your system, the speedup of this operation can be different.
Module
is PyTorch's way of performing operations on tensors. Modules are implemented as subclasses of the torch.nn.Module
class. All modules are callable and can be composed together to create complex functions.
The package torch.nn
defines a series of useful classes like linear networks layers, activation functions, loss functions etc. A full list can be found here. In case you need a certain network layer, check the documentation of the package first before writing the layer yourself as the package likely contains the code for it already. We import it below:
import torch.nn as nn
Most of the functionality implemented for modules can be accessed in a functional form via torch.nn.functional
, but these require you to create and manage the weight tensors yourself.
Hence, the functional package is useful in many situations, and so we import it as well here.
import torch.nn.functional as F
In PyTorch, a neural network is built up out of modules. Modules can contain other modules, and a neural network is considered to be a module itself as well. The basic template of a module is as follows:
class MyModule(nn.Module):
def __init__(self):
super().__init__()
# Some init for my module
def forward(self, x):
# Function for performing the calculation of the module.
pass
The forward function is where the computation of the module is taken place, and is executed when you call the module (nn = MyModule(); nn(x)
). In the init function, we usually create the parameters of the module, using nn.Parameter
, or defining other modules that are used in the forward function. The backward calculation is done automatically, but could be overwritten as well if wanted.
Now we use a simple regression problem to describe how we implement PyTorch:
x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1) # x data (tensor), shape=(100, 1)
y = x.pow(2) + 0.2*torch.rand(x.size()) # noisy y data (tensor), shape=(100, 1)
plt.scatter(x.data.numpy(), y.data.numpy())
plt.show()
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.predict = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
net = Net(n_feature=1, n_hidden=10, n_output=1) # define the network
print(net) # net architecture
Net( (hidden): Linear(in_features=1, out_features=10, bias=True) (predict): Linear(in_features=10, out_features=1, bias=True) )
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
plt.ion() # something about plotting
for t in range(100):
prediction = net(x) # input x and predict based on x
loss = loss_func(prediction, y) # must be (1. nn output, 2. target)
optimizer.zero_grad() # clear gradients for next train
loss.backward() # backpropagation, compute gradients
optimizer.step() # apply gradients
if t % 10 == 0:
# plot and show learning process
plt.cla()
plt.scatter(x.data.numpy(), y.data.numpy())
plt.plot(x.data.numpy(), prediction.data.numpy(), 'r-', lw=5)
plt.text(0.5, 0, 'Loss=%.4f' % loss.data.numpy(), fontdict={'size': 20, 'color': 'red'})
plt.show()
plt.pause(0.1)
plt.ioff()
More advanced tutorial in here.
Data Augmentation in PyTorch