In [578]:

%matplotlib inline
from preamble import *

In [579]:

import torch
import torch.nn as nn

Reproducibility¶

We set the random seeds so that the training results are always the same
Feel free to change the seed number to see the effects of the random initialization of the network weights on the training results

In [580]:

SEED = 10
torch.manual_seed(SEED)
torch.backends.openmp.deterministic = True
np.random.seed(SEED)

Conventions for this notebook¶

Jargon¶

Unit = activation = neuron
Model = neural network
Feature = dimension of input vector = number of independent variables
Hypothesis = prediction = output of the model

Indices¶

Data points: $i = 1,..., n$
Features: $k = 1,..., p$
Layers: $j = 1,..., l$
Activation unit label: $s$

Scalars¶

$u^j$ = number of units in layer $j$
$a_s^j$ is the activation unit $s$ in layer $j$

Vectors and matrices¶

Check slide 13 of the lecture notes for a visualization of the dimensions.

$\pmb{X}$: input vector of dimension $[n \times (p \times 1)]$
$a^j$: activation vector of layer $j$ of dimension $[(u^j + 1) \times 1]$
$\pmb{\theta}^j$: weight matrix from layer $j$ to $j+1$, of dimension $[u^{j+1} \times (u^j + 1)]$

where the $+1$ accounts for the bias unit

$$ \pmb{X} = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_p \end{bmatrix} \ \ ; \ \ \pmb{\theta}^j = \begin{bmatrix} \theta_{10} & \dots & \theta_{1(u^j + 1)}\\ \theta_{20} & \ddots\\ \vdots \\ \theta_{(u^{j+1}) 0} & & \theta_{(u^{j+1})(u^j + 1)}\\ \end{bmatrix} $$

0. Introduction¶

In this notebook we will train a neural network to fit an arbitrary function.

0.1. Universal Approximation Theorem¶

When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.
This is where the power of neural networks comes from!

0.2. Create a function to fit¶

Let's create a simple non-linear function to fit with our neural network:

In [581]:

sample_points = 3e3
x_lim = 100
x = np.linspace(0, x_lim, int(sample_points))
y = np.sin(x * x_lim * 1e-4) * np.cos(x * x_lim * 1e-3) * 3
plt.plot(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Function to be fitted')

Out[581]:

Text(0.5, 1.0, 'Function to be fitted')

0.3. Data shape¶

Our data is 1D, meaning it has only one feature.
We want a model that for a given $x$ it returns the correspondent $y$ value.
This means that a model with one neuron input and a one neuron output suffices:

In [582]:

n_input = 1
n_out = 1

In order for the model to take each point of the data one by one we need to do some additional re-shaping, where we introduce an additional dimension for each entry:

In [583]:

x_reshape = x.reshape((int(len(x) / n_input), n_input))
y_reshape = y.reshape((int(len(y) / n_out), n_out))

In [584]:

# # Uncomment to check the shape change
# print(x.shape, y.shape)
# print(x_reshape.shape, y_reshape.shape)
# print(x[10], x_reshape[10])

0.4. Data type¶

The data that we will input to the model needs to be of the type torch.float32

Side Remark: The default dtype of torch tensors (also the layer parameters) is torch.float32, which is related to the GPU performance optimization. If one wants to use torch.float64/torch.double instead, one can set the tensors to double precision via v = v.double() or set the global precision via torch.set_default_dtype(torch.float64). Just keep in mind, the NN parameters and the input tensors should have the same precision.

Before starting, let's convert our data numpy arrays to torch tensors:

In [585]:

x_torch = torch.from_numpy(x_reshape)
y_torch = torch.from_numpy(y_reshape)

In [586]:

# Type checking:
print(x.dtype, y.dtype)
print(x_torch.dtype, y_torch.dtype)

float64 float64
torch.float64 torch.float64

The type is still not correct, but we can easily convert it:

In [587]:

x_torch = x_torch.to(dtype=torch.float32)
y_torch = y_torch.to(dtype=torch.float32)

In [588]:

# Type checking:
print(x.dtype, y.dtype)
print(x_torch.dtype, y_torch.dtype)

float64 float64
torch.float32 torch.float32

0.4. Data normalization¶

We will also need to normalize the data to make sure we are in the non-linear region of the activation functions:

In [589]:

x_norm = torch.nn.functional.normalize(x_torch, p=5, dim=0)
y_norm = torch.nn.functional.normalize(y_torch, p=5, dim=0)

The torch.nn.functional.normalize function performs $L_p$ normalization, where the $L_p$ norm is:

$$ ||x||_p = (\sum_{i=1}^n |x_i|^p)^{1/p} \ \ \ p>0 $$

In [590]:

plt.plot(x_norm.detach().numpy(), y_norm.detach().numpy())
plt.title('Normalized function')

Out[590]:

Text(0.5, 1.0, 'Normalized function')

2. Build your model¶

In PyTorch Sequential stands for sequential container, where modules can be added sequentially and are connected in a cascading way. The output for each module is forwarded sequentially to the next.
Now we will build a simple model with one hidden layer with Sequential
Remember that every layer in a neural network is followed by an activation layer that performs some additional operations on the neurons.

2.1 Activation functions¶

2.2. Model architecture¶

In [591]:

n_hidden_01 = 5

model0 = nn.Sequential(nn.Linear(n_input, n_hidden_01),
                      nn.Tanh(),
                      nn.Linear(n_hidden_01, n_out),
                      nn.Tanh()
                      )
print(model0)

Sequential(
  (0): Linear(in_features=1, out_features=5, bias=True)
  (1): Tanh()
  (2): Linear(in_features=5, out_features=1, bias=True)
  (3): Tanh()
)

In [592]:

n_hidden_11 = 10

model1 = nn.Sequential(nn.Linear(n_input, n_hidden_11),
                      nn.Tanh(),
                      nn.Linear(n_hidden_11, n_out),
                      nn.Tanh()
                      )
print(model1)

Sequential(
  (0): Linear(in_features=1, out_features=10, bias=True)
  (1): Tanh()
  (2): Linear(in_features=10, out_features=1, bias=True)
  (3): Tanh()
)

In [593]:

n_hidden_21 = 5
n_hidden_22 = 5
model2 = nn.Sequential(nn.Linear(n_input, n_hidden_21),
                      nn.Tanh(),
                      nn.Linear(n_hidden_21, n_hidden_22),
                      nn.Tanh(),
                      nn.Linear(n_hidden_22, n_out),
                      nn.Tanh()
                      )
print(model2)

Sequential(
  (0): Linear(in_features=1, out_features=5, bias=True)
  (1): Tanh()
  (2): Linear(in_features=5, out_features=5, bias=True)
  (3): Tanh()
  (4): Linear(in_features=5, out_features=1, bias=True)
  (5): Tanh()
)

How much do you think each hyperparameter will affect the quality of the model?

You can uncomment and execute the next line to explore the methods of the model object you created

In [594]:

# dir(model)

2.1 - Understanding the PyTorch model¶

Try the parameters method (needs to be instantiated).

In [595]:

model0.parameters()

Out[595]:

<generator object Module.parameters at 0x16923d350>

The parameters method gives back a generator, which means it needs to be iterated over to give back an output:

In [596]:

for element in model0.parameters():
    print(element.shape)

torch.Size([5, 1])
torch.Size([5])
torch.Size([1, 5])
torch.Size([1])

Without taking into account any bias unit: can you identify the elements of the model by their dimensions?

The first element corresponds to the weight matrix $\theta^0$ from layer 0 to layer 1, of dimensions $u^{j+1} \times u^j = u^2 \times u^1$ (so, without bias)
The second element corresponds to the values of the activation units in layer 1
The third element corresponds to the weight matrix $\theta^1$ from layer 1 to layer 2, of dimensions $u^{j+1} \times u^j = u^3 \times u^3 $ (without bias)
The fourth element is the output of the model

Let's have a look at what the contents of those tensors:

In [597]:

for element in model0.parameters():
    print(element)

Parameter containing:
tensor([[-0.0838],
        [-0.0343],
        [-0.3750],
        [ 0.2300],
        [-0.5721]], requires_grad=True)
Parameter containing:
tensor([-0.1763,  0.3876,  0.9386,  0.2356, -0.3393], requires_grad=True)
Parameter containing:
tensor([[ 0.0429, -0.0501,  0.1825,  0.0512,  0.1752]], requires_grad=True)
Parameter containing:
tensor([0.4337], requires_grad=True)

What are these values?

3 - Define the loss function¶

Reminder: the loss function measures how distant the predictions made by the model are from the actual values
torch.nn provides many different types of loss functions. One of the most popular ones in the Mean Squared Error (MSE) since it can be applied to a wide variety of cases.
In general cost functions are chosen depending on desirable properties, such as convexity.

In [598]:

loss_function = nn.MSELoss()

4 - Define the optimizer¶

torch.optim provides implementations of various optimization algorithms. The optimizer object will hold the current state and will update the parameters of the model based on computer gradients. It takes as an input an iterable containing the model parameters, that we explored before.

In [599]:

batch_size = 200 # how many points to pass to the model at a time
learning_rate = 0.015

In [600]:

optimizer0 = torch.optim.Adam(model0.parameters(), lr=learning_rate)
optimizer1 = torch.optim.Adam(model1.parameters(), lr=learning_rate)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=learning_rate)

In [601]:

# optimizer0 = torch.optim.SGD(model0.parameters(), lr=learning_rate)
# optimizer1 = torch.optim.SGD(model1.parameters(), lr=learning_rate)
# optimizer2 = torch.optim.SGD(model2.parameters(), lr=learning_rate)

5 - Train the model on a loop¶

The model learns iteratively in a loop of a given number of epochs. Each loop consists of:

A forward propagation: compute $y$ given the input $x$ and current weights and calculate the loss
A backward propagation: compute the gradient of the loss function (error of the loss at each unit)
Gradient descent: update model weights

In [602]:

epochs = 1000

In [603]:

losses0 = []
for epoch in range(epochs):
    pred_y0 = model0(x_norm)
    optimizer0.zero_grad()
    loss0 = loss_function(pred_y0, y_norm)
    losses0.append(loss0.item())
    loss0.backward()
    optimizer0.step()

In [604]:

losses1 = []
for epoch in range(epochs):
    pred_y1 = model1(x_norm)
    optimizer1.zero_grad()
    loss1 = loss_function(pred_y1, y_norm)
    losses1.append(loss1.item())
    loss1.backward()
    optimizer1.step()

In [605]:

losses2 = []
for epoch in range(epochs):
    pred_y2 = model2(x_norm)
    optimizer2.zero_grad()
    loss2 = loss_function(pred_y2, y_norm)
    losses2.append(loss2.item())
    loss2.backward()
    optimizer2.step()

In [606]:

plt.plot(losses0, label='Model 0')
plt.plot(losses1, label='Model 1')
plt.plot(losses2, label='Model 2')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.title("Learning rate %f"%(learning_rate))
plt.legend()
plt.show()

6 - Test the trained model¶

Let's create some random points in the x-axis within the model's interval that will serve as test data.
We will do the same data manipulations as before.

In [607]:

test_points = 50
x_test = np.random.uniform(0, np.max(x_norm.detach().numpy()), test_points)
x_test_reshape = x_test.reshape((int(len(x_test) / n_input), n_input))
x_test_torch = torch.from_numpy(x_test_reshape)
x_test_torch = x_test_torch.to(dtype=torch.float32)

Now we predict the y-value with our model:

In [608]:

y0_test_torch = model0(x_test_torch)
y1_test_torch = model1(x_test_torch)
y2_test_torch = model2(x_test_torch)

In [609]:

plt.plot(x_norm.detach().numpy(), y_norm.detach().numpy())
plt.scatter(x_test_torch.detach().numpy(), y0_test_torch.detach().numpy(), color='red', label='Model 0')
plt.scatter(x_test_torch.detach().numpy(), y1_test_torch.detach().numpy(), color='orange', label='Model 1')
plt.scatter(x_test_torch.detach().numpy(), y2_test_torch.detach().numpy(), color='green', label='Model 2')
plt.legend()

Out[609]:

<matplotlib.legend.Legend at 0x169346f70>

What do you think of these results, do they follow the universal approximation theorem?

7 - Play with the notebook!¶

Some ideas:

Change the number of epochs in Section 5 to 5000 and re-train the models. What happens?
Change the random seed in the Reproducibility cell at the very top. How do the results change?
Change the optimizer in Section 4 from Adam to SGD and re-train the models. What happens?
[if time allows, takes several minutes] Change the epochs in Section 5 to 1000000. What happens?
Go back to 1000 epochs and the Adam optimizer. Change the learning rate in Section 4 to 0.05. How do the results change? what does it tell us about our previous value?
Change the learning rate to 0.5. What happens now?

In [610]:

# %reset -f