import numpy as np
import matplotlib.pyplot as plt
Disclaimer
The book mistakently refers to the page for Exercise 4.2 when introducting Exercise 4.1, etc. Of course, these numbers should match: Book Exercise 4.1 is discussed under Exercise 4.1
Simple Network
We continue with the dataset first encountered in Exercise 3.2. Please refer to the discussion there for an introduction to the data and the learning objective.
Here, we manually implement a simple network architecture
# The code snippet below is responsible for downloading the dataset
# - for example when running via Google Colab.
#
# You can also directly download the file using the link if you work
# with a local setup (in that case, ignore the !wget)
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
/bin/sh: wget: command not found
# Before working with the data,
# we download and prepare all features
# load all examples from the file
data = np.genfromtxt('winequality-white.csv',delimiter=";",skip_header=1)
print("data:", data.shape)
# Prepare for proper training
np.random.shuffle(data) # randomly sort examples
# take the first 3000 examples for training
# (remember array slicing from last week)
X_train = data[:3000,:11] # all features except last column
y_train = data[:3000,11] # quality column
# and the remaining examples for testing
X_test = data[3000:,:11] # all features except last column
y_test = data[3000:,11] # quality column
print("First example:")
print("Features:", X_train[0])
print("Quality:", y_train[0])
data: (4898, 12) First example: Features: [5.900e+00 3.400e-01 2.200e-01 2.400e+00 3.000e-02 1.900e+01 1.350e+02 9.894e-01 3.410e+00 7.800e-01 1.390e+01] Quality: 7.0
The goal is to implement the training of a neural network with one input layer, one hidden layer, and one output layer using gradient descent. We first (below) define the matrices and initialise with random values. We need W, b, W' and b'. The shapes will be:
W
b
Wp
bp
Your tasks are:
dnn
(see below)update_weights
skeleton below# Initialise weights with suitable random distributions
hidden_nodes = 50 # number of nodes in the hidden layer
n_inputs = 11 # input features in the dataset
# See section 4.3 of the book for more information on
# how to initialise network parameters
W = np.random.randn(hidden_nodes,11)*np.sqrt(2./n_inputs)
b = np.random.randn(hidden_nodes)*np.sqrt(2./n_inputs)
Wp = np.random.randn(hidden_nodes)*np.sqrt(2./hidden_nodes)
bp = np.random.randn((1))
print(W.shape)
(50, 11)
# You can use this implementation of the ReLu activation function
def relu(x):
return np.maximum(x, 0)
def dnn(x,W,b,Wp,bp):
# TODO Calculate and return network output of forward pass
# See Hint 1 for additional information
return -1 # change to the calculated output
def update_weights(x,y, W, b, Wp, bp):
learning_rate = 0.01
# TODO: Calculate the network output (use the function dnn defined above)
# TODO: Derive the gradient for each of W,b,Wp,bp by taking the partial
# derivative of the loss function with respect to the variable and
# then implement the resulting weight-update procedure
# See Hint 2 for additional information
# You might need these numpy functions:
# np.dot, np.outer, np.heaviside
# Hint: Use .shape and print statements to make sure all operations
# do what you want them to
# TODO: Update the weights/bias following the rule: weight_new = weight_old - learning_rate * gradient
return -1 # no return value needed, you can modify the weights in-place
# The code below implements the training.
# If you correctly implement dnn and update_weights above,
# you should not need to change anything below.
# (apart from increasing the number of epochs)
train_losses = []
test_losses = []
# How many epochs to train
# This will just train for one epoch
# You will want a higher number once everything works
n_epochs = 1
# Loop over the epochs
for ep in range(n_epochs):
# Each epoch is a complete over the training data
for i in range(X_train.shape[0]):
# pick one example
x = X_train[i]
y = y_train[i]
# use it to update the weights
update_weights(x,y,W,b,Wp,bp)
# Calculate predictions for the full training and testing sample
y_pred_train = [dnn(x,W,b,Wp,bp)[0] for x in X_train]
y_pred = [dnn(x,W,b,Wp,bp)[0] for x in X_test]
# Calculate aver loss / example over the epoch
train_loss = sum((y_pred_train-y_train)**2) / y_train.shape[0]
test_loss = sum((y_pred-y_test)**2) / y_test.shape[0]
# print some information
print("Epoch:",ep, "Train Loss:", train_loss, "Test Loss:", test_loss)
# and store the losses for later use
train_losses.append(train_loss)
test_losses.append(test_loss)
# After the training:
# Prepare scatter plot
y_pred = [dnn(x,W,b,Wp,bp)[0] for x in X_test]
print("Best loss:", min(test_losses), "Final loss:", test_losses[-1])
print("Correlation coefficient:", np.corrcoef(y_pred,y_test)[0,1])
plt.scatter(y_pred_train,y_train)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()
# Prepare and loss over time
plt.plot(train_losses,label="train")
plt.plot(test_losses,label="test")
plt.legend()
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
We want a network with one hidden layer. As activiation in the hidden layer $\sigma$ we apply element-wise ReLu, while no activation is used for the output layer. The forward pass of the network then reads: $$\hat{y}=\mathbf{W}^{\prime} \sigma(\mathbf{W} \vec{x}+\vec{b})+b^{\prime}$$
For the regression problem the objective function is the mean squared error between the prediction and the true label $y$: $$ L=(\hat{y}-y)^{2} $$
Taking the partial derivatives - and diligently the applying chain rule - with respect to the different objects yields:
$$ \begin{aligned} \frac{\partial L}{\partial b^{\prime}} &=2(\hat{y}-y) \\ \frac{\partial L}{\partial b_{k}} &=2(\hat{y}-y) \mathbf{W}_{k}^{\prime} \theta\left(\sum_{i} \mathbf{W}_{i k} x_{i}+b_{k}\right) \\ \frac{\partial L}{\partial \mathbf{W}_{k}^{\prime}} &=2(\hat{y}-y) \sigma\left(\sum_{i} \mathbf{W}_{i k} x_{i}+b_{k}\right) \\ \frac{\partial L}{\partial \mathbf{W}_{k m}} &=2(\hat{y}-y) \mathbf{W}_{m}^{\prime} \theta\left(\sum_{i} \mathbf{W}_{i k} x_{i}+b_{m}\right) x_{k} \end{aligned} $$Here, $\Theta$ denotes the Heaviside step function.