#!/usr/bin/env python # coding: utf-8 # ## Week 6. Neural Nets and backpropagation # In this session we will continue the work we started on neural networks. In particular, in this exercise, you will get to code a neural network from scratch and will be asked to understand and code backpropagation to train the network # Drawing # # image credit: [https://appen.com](https://appen.com) # __Exercise 1.__ Let us go back briefly to a simple dataset to make sure we understand how things work. As a first exercise, we will code a one hidden layer neural network that outputs a binary 0/1 variable indicating the class of our data. Throughout this exercise, we will use the notation $z^{\ell+1} = \sigma(a^{\ell+1})$ to denote the output of any neuron from the $(\ell+1)^{th}$ layer and where $a^{\ell+1} = \sum_{k} w_{\ell+1,k} z_k$ is the combination from the previous layer that is fed to the neuron. # # Drawing # # image credit: [MaviccPRP@web.studio](https://maviccprp.github.io) # We will use the function ['minimize'](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) from scipy. Check the documentation of that function. We will set the 'jac' parameter of the optimizer to 'True', which implies that the (objective) function that we provide as an input should return both the value of the loss and the value of its gradient. # # For this first exercise, you are asked to write a function $f(W)$ wich takes as arguments a vector $W$ containing all the parameters of your network (as a first step consider building a network with only a few hidden units, in order to make sure the model is working). As indicated above, the function should return (1) the value of the binary cross entropy for the given set of weights and (2) the value of the gradient derived through Backpropagation (as we set the value of 'jac' to True). # # We will split the writing of your function into several steps. Once you have coded each step, you should gather them together in a single $fun(W)$ body that you will then pass as input to the minimize function. # # __Exercise 1.a.__ Start Load the data using the lines below and plot it using scatter( ) # In[ ]: from scipy.optimize import minimize import scipy.io as sio data1 = sio.loadmat('pointsClass1Week6.mat') data2 = sio.loadmat('pointsClass2Week6.mat') from numpy import linalg as LA data1 = data1['pointsClass1Week6'] data2 = data2['pointsClass2Week6'] # put your code here # __Exercise 1.b__ As indicated above, we want to apply the network to the simple binary dataset that you loaded above. We want to build an architecture similar to the one shown above, except that, since we only consider a 2D dataset, we only need 2 inputs. we want our activation function to be all sigmoid. Start by defining the function sigmoid and the gradient of this function. Once you are done, check your derivative. Also make sure you can compute the entrywise sigmoid on any numpy array if the input to your function is a numpy array. # In[ ]: # put the code here # __Exercise 1.c__ Now that we have the sigmoid and its gradient, we will code the loss. As you might remember from previous labs, the MLPClassifier from scikit-learn optimizes the 'log-loss function' (a.k.a Binary cross entropy) which reads for a set of $N$ prototypes $x_i$ with binary $0/1$ targets, # # $$−\frac{1}{N}\sum_{i=1}^N (y_i \log(p_{W}(x_i))+(1−y_i)\log(1−p_{W}(x_i))) $$ # # Here the probability $p_{W}(x_i)$ is the output of your network. Instead of minimizing this function directly, we will consider its $\ell_2$ regularized version # # $$−\frac{1}{N}\sum_{i=1}^N (y_i \log(p_{W}(x_i))+(1−y_i)\log(1−p_{W}(x_i))) + \lambda \sum_{j\in\text{weights}\setminus \text{bias}} W^2_j$$ # # Where the $W_j$ encode the weights. Code that function for a given labeled dataset $X,t$ such as given above, a set of weights stored in the vector $W$ (you can use a list or a dictionnary if you want but ultimately, you will need to store them in a numpy vector for use with the optimization routine). Note that one typically does not regularize the bias terms. # In[ ]: # put your code here # __Exercise 1.d. Towards Backpropagation.__ The example above is relatively simple so that backpropagation is not really needed. Compute the gradient of the Binary cross entropy loss with respect to the weights. # In[ ]: # put your code here # __Exercise 1.e.__ Combine all your previous steps into a single function and pass this function to the 'minimize' routine. To choose the initial value for the weights, a common heuristic is for the n^th layer to be initialized unformly at random in the interval $[-\varepsilon_n, \varepsilon_n]$ with $\epsilon_n$ defined as $\sqrt{\frac{2}{\text{size layer}_{\ell-1} + \text{size layer}_{\ell} }}$ # # # # In[ ]: # put the body of your function here # __Bonus 1__ change the activation to the Relu and reapeat the steps above. How do you compute the gradient in this case? # __Exercise 2.__ Coding a One hidden layer neural network is good to warm up but to really understand backpropagation, we will not add a few hidden layers. Still relying on the code that you developed above and using backpropagation, train a depth 4 neural network with 10 neurons in each layer on the binary dataset. # __Exercise 2a.__ In this exercise, because of the multiple layers of the network, you will get to really code backpropagation. To do this, follow the steps below # (see [Bishop Pattern Recognition and Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)) for more details. # # - Take any of the sample (Once you have completed one step of backprop, you should re-apply it to the next sample to get the gradient contribution for that next sample and so on). Forward propagate that sample through the network and compute all the activations $z = \sigma(a_i)$ and input $a_i$ for each unit. # # - Evaluate the $\delta_k = y_k(x_n) - t_{n,k}$ for all the output units. # # - Backpropagate the $\delta_k$ using the formula # # $$\delta_j = \sigma'(a_j) \sum_{k} w_{k,j} \delta_k$$ # # - Finally, compute the derivatives as # # $$\frac{\partial \text{loss}}{\partial w_{ij}} = \delta_j z_i$$ # # Note that when coding the backpropagation algorithm, you don't need to account for the regularization part of the loss as you can just add the gradient of the regularization to the result of the backpropagation algorithm. # # In[ ]: from scipy.optimize import minimize # put your code here # __Exercise 3.__ Extending to multiple classes. Load the data using the lines given below, visualize it with scatter. How can you extend the Binary cross entropy function to the multiclass problem? # In[ ]: import scipy.io as sio data1 = sio.loadmat('xWeek6Ex3PointsClass1.mat') data2 = sio.loadmat('xWeek6Ex3PointsClass2.mat') data3 = sio.loadmat('xWeek6Ex3PointsClass3.mat') from numpy import linalg as LA data1 = data1['xWeek6Ex3PointsClass1'] data2 = data2['xWeek6Ex3PointsClass2'] data3 = data3['xWeek6Ex3PointsClass3'] # put your code here