#!/usr/bin/env python # coding: utf-8 # # Deep Learning methods and Neural Networks # # # **Universal approximation theorem**: A continuous function can be approximated to an arbitrary accuracy if one has at least one hidden layer with finite number of neurons in neural network. The non-linear/activation function can be sigmoid (fermi) [Cybenko in 1989], or just general nonpolynomial bounded activation function [Leshno in 1993 and Pinkus in 1999]. # # The multilayer architecture of NN gives neural networks the potential of being universal approximators. # # Given a function $y=F(x)$ with $x\in [0,1]^d$ and $f(z)$ is a non-linear bounded activation function and $\epsilon>0$ is chosen accuracy, there is a one layer NN with $w\in \mathbb{R}^{m\times n}$ and $b\in \mathbb{R}^n$ and $x\in\mathbb{R}^m$ and $z_j=\sum w_{ij} x_i + b_j$ so that $|\sum_i w^{(2)}_{ij} f(z_i)+b_j-F(x_j)|<\epsilon$. # Conceptually, it is helpful to divide neural networks into four # categories: # 1. general purpose neural networks for supervised learning, # # 2. neural networks designed specifically for image processing, the most prominent example of this class being Convolutional Neural Networks (CNNs), # # 3. neural networks for sequential data such as Recurrent Neural Networks (RNNs), and # # 4. neural networks for unsupervised learning such as Deep Boltzmann Machines. # # In natural science, DNNs and CNNs have already found numerous # applications. In statistical physics, they have been applied to detect # phase transitions in 2D Ising and Potts models, lattice gauge # theories, and different phases of polymers, or solving the # Navier-Stokes equation in weather forecasting. Deep learning has also # found interesting applications in quantum physics. Various quantum # phase transitions can be detected and studied using DNNs and CNNs, # topological phases, and even non-equilibrium many-body # localization. Representing quantum states as DNNs quantum state # tomography are among some of the impressive achievements to reveal the # potential of DNNs to facilitate the study of quantum systems. # # #
Figure: Sketch of the neural network, the input layer is on the left ($x_i=a_i^{(0)}$) and the output layer ($a_i^{(L)}$). The latter is compared through the cost function with the target $t_i$. The layers between 0 and $L$ are called hidden layers, which increase flexibility of the network. $f(z)$ is nonlinear activation function.
# # An artificial neural network (ANN), is a computational model that # consists of layers of connected neurons (sometimes called nodes or units). # # The equations are sketched in the figure, and in matrix form read: # \begin{eqnarray} # \textbf{z}^l &=& (\textbf{a}^{l-1}) \textbf{w}^l + \textbf{b}^l\\ # a_i^l &=& f(z_i^l) # \end{eqnarray} # Here $l$ in $a_i^l,z_i^l$ stands for the layer $l$. Using input parameters $a^{l-1}$ in layer $l$ we get output $z^l$, which are than passed through a non-linear activation function $f$ to obtain $a^l$. This in turn allowes one to calculate the next layer. Note that we used many yet to be determined weights $w$ and $b$, which are determined so that they best fit the known data, i.e., on input $x_i$ give as good approximation to target $t_i$ as possible. # # # We start with input $x_i$, which defines the first layer $a^{0}$, and we end with output layer $a^L$, which delivers the output, and is needed to evaluate the cost function: # \begin{eqnarray} # a_i^{0}\equiv x_i\\ # C(\{w,b\})&=&\frac{1}{2}\sum_i (a_i^L-t_i)^2 # \end{eqnarray} # The target $t$ is the known data we train on, which was called $y$ in the linear regression. To compare with linear regression $\widetilde{y}$ is the output layer $a_i^L$. # # # # # NN is supposed to mimic a biological nervous system by letting each # neuron interact with other neurons by sending signals in the form of # mathematical functions between layers. A wide variety of different # ANNs have been developed, but most of them consist of an input layer, # an output layer and eventual layers in-between, called *hidden # layers*. All layers can contain an arbitrary number of nodes, and each # connection between two nodes is associated with a weight variable $w_{ij}$ and $b_i$. # # # Withouth the nonlinear activation function NN would be equivalent to the linear regression (convince yourself). The added nonlinearity through activation function $f$ is thus crucial for the success of NN. Many choices of activation functions are in use. We mention just a few: # * sigmoid (fermi) $f(z)=1/(e^{-z}+1)$ # * rectified linear unit (Relu) $f(z)=max(0,z)$ # * $\tanh(z)$, which is related to fermi by $\tanh(z/2)=f(-z)-f(z)$ # * Exponential linear unit (Elu): $f(z)= \textrm{if}(z<0) ( \alpha(e^{t z}-1 )\textrm{else}(z)$ with $z\ll 1$ # * Leaky Relu : $f(z)=\textrm{if}(z<0) (\alpha z )\textrm{else}(z)$ with $z\ll 1$ # ### Simple example OR and XOR gate # # As we will show, the OR gate can be easily fit with linear regression, however, XOR gate can not be, and requires at least one hidden layer. # #Figure: OR and XOR gate with line that can or can not describe it.
# # The OR gate # \begin{equation} # \begin{array}{c|c|c} # x_1 & x_2 & t\\ # \hline # 0 & 0 & 0\\ # 0 & 1 & 1\\ # 1 & 0 & 1\\ # 1 & 1 & 1 # \end{array} # \end{equation} # and XOR gate # \begin{equation} # \begin{array}{c|c|c} # x_1 & x_2 & t\\ # \hline # 0 & 0 & 0\\ # 0 & 1 & 1\\ # 1 & 0 & 1\\ # 1 & 1 & 0 # \end{array} # \end{equation} # # Let's try linear regression. The design matrix should contain a constant and linear term, i.e., $X^T=[1,x_1,x_2]$, which is # # \begin{equation} # X=\begin{bmatrix} # 1& 0 & 0 \\ # 1& 0 & 1 \\ # 1& 1 & 0 \\ # 1& 1 & 1 # \end{bmatrix} # \end{equation} # and linear regression gives $\widetilde{y} = X \beta = X (X^T X)^{-1} X^T y$. # # It is easy to check that for $y^T_{OR}=[0,1,1,1]$ we get # $X(X^T X)^{-1} X^T y_{OR} =[1/4,3/4,3/4,5/4]$ while for $y^T_{XOR}=[0,1,1,0]$ we get $X(X^T X)^{-1} X^T=[1/2,1/2,1/2,1/2]$. If we assume that $\widetilde{y}_i<1/2$ means 0 and $\widetilde{y}_i>1/2$ is 1, we reproduce OR get, but clearly fail at XOR. # # As we will show below, one hidden layer can easily give XOR gate. A small technicality first: In the linear regression we wanted to have a constant allowed in the fit, hence our $X^T$ started with unity (to allow $\beta_0$ as constat). In ML we always add constant explicitely as an additional degree of freedom (see equations above), hence $X^T$ doe not need to have unity, and it will just be $X^T=[x_1,x_2]$. More precisely # \begin{equation} # X=\begin{bmatrix} # 0 & 0 \\ # 0 & 1 \\ # 1 & 0 \\ # 1 & 1 # \end{bmatrix} # \end{equation} # For the activation function $f(z)$ we will choose **Relu**: $f(z)=max(z,0)$. # # We will choose two neurons in the hidden layer, hence $w_h$ is $2x2$ matrix and $b_h$ is two component vector, in terms of which $\textbf{z}^h=\textbf{X}\textbf{w}^h+\textbf{b}^h$, $\textbf{a}^h=f(\textbf{z}^h)$, and the output $\textbf{y}\equiv \textbf{a}^o=\textbf{a}^{(h)}\textbf{w}^o+\textbf{b}^o$ # # The minimization would give the following weights # \begin{eqnarray} # && \textbf{w}^h=\begin{bmatrix} # 1 & 1 \\ # 1 & 1 # \end{bmatrix}\\ # && \textbf{b}^h=\begin{bmatrix} # 0 & -1 # \end{bmatrix}\\ # && \textbf{w}^o=\begin{bmatrix} # 1 \\ # -2 # \end{bmatrix}\\ # && \textbf{b}^o=0 # \end{eqnarray} # # Which means that # \begin{eqnarray} # \textbf{z}^h= # \begin{bmatrix} # 0 & -1 \\ # 1 & 0\\ # 1 & 0\\ # 2 & 1 # \end{bmatrix}\\ # \textbf{a}^h= # \begin{bmatrix} # 0 & 0 \\ # 1 & 0\\ # 1 & 0\\ # 2 & 1 # \end{bmatrix} # \end{eqnarray} # and finally # \begin{eqnarray} # \textbf{a}^h \textbf{w}^o= # \begin{bmatrix} # 0 \\ 1 \\ 1 \\ 0 # \end{bmatrix} # \end{eqnarray} # which is identical to target $t$ for XOR gate, and concludes our example. # To solve NN problem we usually distinguish between the following steps: # 0) Randomly initialize weight and biases # # 1) The feed forward stage, which calculates all $\textbf{a}^l$ and $z_l$, including the output $\textbf{a}^L$ to be used in the cost function, and compares with the target $\textbf{t}$. # # 2) Back propagation stage follows in which one calculates the gradients of weights $\textbf{w}$ and biases $\textbf{b}$ ($\partial C/\partial \textbf{w}$ and $\partial C/\partial \textbf{b}$). Using minimization algorithm (including stochastic approach), we move towards a minimum, which is hopefully close or equal to global minimum. # # 3) We repeat the two steps (1) and (2) until the error of the cost function is acceptable and model is train well enough. # ## Back propagation and automatic differentiation # # It is convenient to differentiate from the end of the NN towards the start, hence we call this back propagation. We start with differentiation the cost function # $$C(\{w,b\})=\frac{1}{2}\sum_i (a_i^L-t_i)^2,$$ which gives # \begin{equation} # \frac{\partial C}{\partial w_{jk}^L}=\sum_i (a_i^L-t_i) \frac{\partial a_i^L}{w_{jk}^L}\\ # \frac{\partial C}{\partial b_{j}^L}=\sum_i (a_i^L-t_i) \frac{\partial a_i^L}{b_{j}^L} # \end{equation} # because $a_i^L=f(z_i^L)$, we have # \begin{equation} # \frac{\partial C}{\partial w_{kj}^L}=\sum_i (a_i^L-t_i) f'(z_i^L) \frac{\partial z_i^L}{w_{kj}^L} # \\ # \frac{\partial C}{\partial b_{j}^L}=\sum_i (a_i^L-t_i) f'(z_i^L) \frac{\partial z_i^L}{b_{j}^L} # \end{equation} # finally $z_i^L = \sum_j a_j^{L-1} w_{ji}+b_i$, hence # \begin{eqnarray} # &&\frac{\partial z_i^L}{w_{kj}^L}=a_k^{L-1} \delta_{ij}\\ # &&\frac{\partial z_i^L}{b_{j}^L} = \delta_{ij} # \end{eqnarray} # which finally gives # \begin{eqnarray} # &&\frac{\partial C}{\partial w_{kj}^L}=(a_j^L-t_j) f'(z_j^L) a_k^{L-1} # \\ # &&\frac{\partial C}{\partial b_{j}^L}=(a_j^L-t_j) f'(z_j^L) # \end{eqnarray} # Next we define the quatity $$\delta_j^L\equiv (a_j^L-t_j) f'(z_j^L)$$ in terms of which we can express # \begin{eqnarray} # &&\frac{\partial C}{\partial w_{kj}^L}=\delta_j^L a_k^{L-1} # \\ # &&\frac{\partial C}{\partial b_{j}^L}=\delta_j^L # \end{eqnarray} # Note that $\delta_j^L$ can also be viewed as $$\delta_j^L=\frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial z_j^L}=\frac{\partial C}{\partial z_j^L}$$ # # # We then proceed to previous layer, and obtain # \begin{eqnarray} # \frac{\partial C}{\partial w_{kj}^{L-1}}=\sum_{i,n} \frac{\partial C}{\partial a_n^{L}}\frac{\partial a_n^L}{\partial z_n^L}\frac{\partial z_n^L}{\partial a_i^{L-1}}\frac{\partial a_i^{L-1}}{\partial z_i^{L-1}} # \frac{\partial z_i^{L-1}}{\partial w_{kj}^{L-1}} # \end{eqnarray} # We then note that $$\frac{\partial C}{\partial a_n^{L}}\frac{\partial a_n^L}{\partial z_n^L}=\delta^L_n$$ # and because # $z_n^L=\sum_i a_i^{L-1} w^L_{in} + b_n^L$ we have # $$\frac{\partial z_n^L}{\partial a_i^{L-1}}=w_{in}^L$$ # furthermore # $$\frac{\partial a_i^{L-1}}{\partial z_i^{L-1}}=f'(z_i^{L-1})$$ # and further # $z_i^{L-1}=\sum_k a_k^{L-2} w^{L-1}_{ki} + b_i^{L-1}$ so that # $$\frac{\partial z_i^{L-1}}{\partial w_{kj}^{L-1}}=a_K^{L-2}\delta_{ij}$$ # so that collecting all of that leads to # $$ # \frac{\partial C}{\partial w_{kj}^{L-1}}=\sum_{i,n}\delta_n^L w_{in}^L f'(z_i^{L-1})\delta_{ij}a_k^{L-2}= # \sum_n\delta_n^L w_{jn}^L f'(z_j^{L-1})a_k^{L-2} # $$ # Now we will require that # \begin{eqnarray} # \frac{\partial C}{\partial w_{kj}^{l}}=\delta_j^l a_k^{l-1} # \end{eqnarray} # which gives us the following expression # \begin{eqnarray} # \delta_j^{L-1}=\sum_n\delta_n^L w_{jn}^L f'(z_j^{L-1}) # \end{eqnarray} # We can verify that this equation connects every layer with the previous layer, i.e., this equation is valid for every $l$, not just $L-1$. # Similarly we can show that the derivative with respect to $b$ has the same form, namely, # \begin{eqnarray} # \frac{\partial C}{\partial b_{j}^{l}}=\delta_j^l # \end{eqnarray} # # In conclusion, we just showed that the automatic differentiation in back propagation leads to the following set of equations # # \begin{eqnarray} # \frac{\partial C}{\partial w_{kj}^{l}}&=&\delta_j^l a_k^{l-1}\\ # \frac{\partial C}{\partial b_{j}^{l}}&=&\delta_j^l # \end{eqnarray} # in which $\delta_j^l$ can be all obtained by the following recursion relation # \begin{eqnarray} # \delta_j^{l}&=&\sum_n\delta_n^{l+1} w_{jn}^{l+1} f'(z_j^{l}) # \end{eqnarray} # and the starting condition # \begin{eqnarray} # \delta_j^{L}&=&(a_j^L-t_j) f'(z_j^L) # \end{eqnarray} # ### Final algorithm # 1) Initialize all variables to be minimized $\{w,b\}$ and perform the forward pass to compute all $a^l$ parameters. # 2) With current values of $\{w,b\}$ and $a^l$ we compute all gradients $\frac{\partial C}{\partial w_{kj}^{l}}$ and $\frac{\partial C}{\partial b_{j}^{l}}$ and using one of the available minimization routines we take a step towards more optimal variables $\{w,b\}$. Usually one uses some type of gradient descent method, as discussed previously # $$ # w^{(j+1)} = w^{(j)} - \gamma_j \frac{\partial C}{\partial w^{(j)}} # $$ # here $j$ stands for the iteration. # 3) We repeat (1) and (2) until we find local minima # 4) We change hiperparameter or change initial condistions to try finding different local minima. # ### Example code from MNIST dataset on handwritten numbers # # We will develop NN code to recognize the handwritten digits. The data is stored in MNIST dataset, which is included in sklearn. # # In[1]: from numpy import * import matplotlib.pyplot as plt from sklearn import datasets # ensure the same random numbers appear every time random.seed(0) # display images in notebook get_ipython().run_line_magic('matplotlib', 'inline') plt.rcParams['figure.figsize'] = (12,12) # download MNIST dataset digits = datasets.load_digits() # define inputs and labels inputs = digits.images # x_i labels = digits.target # t_i print('inputs = (n_inputs, pixel_width, pixel_height) =',inputs.shape) print('labels = (n_inputs) =',labels.shape) # Here we reshape images so that we have **Design matrix** composed of 64 pixels. We also print a few examples of numbers. # In[2]: # flatten the image # the value -1 means dimension is inferred from the remaining dimensions: 8x8 = 64 n_inputs,nx,ny = inputs.shape inputs = inputs.reshape(n_inputs, nx*ny) print('X = (n_inputs, n_features) =', inputs.shape) # choose some random images to display random_indices = random.choice(range(n_inputs), size=5) for i,image in enumerate(digits.images[random_indices]): plt.subplot(1, 5, i+1) plt.axis('off') plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') plt.title("Label: %d" % digits.target[random_indices[i]]) plt.show() # First we split the data in 80% training and 20% testing data. Wehich data is training and testing should be choosen at random. # In[3]: from sklearn.model_selection import train_test_split # one-liner from scikit-learn library train_size = 0.8 X_train, X_test, Y_train, Y_test = train_test_split(inputs, labels, train_size=train_size,test_size=1-train_size) # equivalently in numpy def train_test_split_numpy(inputs, labels, train_size): n_inputs = len(inputs) inputs_shuffled = inputs.copy() labels_shuffled = labels.copy() random.shuffle(inputs_shuffled) random.shuffle(labels_shuffled) train_end = int(n_inputs*train_size) X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:] Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:] return X_train, X_test, Y_train, Y_test #X_train, X_test, Y_train, Y_test = train_test_split_numpy(inputs, labels, train_size, test_size) print("Number of training images: " + str(len(X_train))) print("Number of test images: " + str(len(X_test))) # The input and output data have dimensions # \begin{eqnarray} # && X\in [n\times 64]\\ # &&t \in [n]. # \end{eqnarray} # # It is easier to change the output vector to so-called hot representation, in which $y=0$ translates into $y=[1,0,0,0,0,0,0,0,0,0]$ and $y=2$ into $y=[0,0,1,0,0,0,0,0,0,0]$, etc. # # In this way we can use equations for binary choice of 10 chategories. The output vector `Y_onehot` is going to be of dimension $n\times 10$, rather than $n.$ # # # The function `to_categorical_numpy` implements the hot representation. # In[4]: # to categorical turns our integer vector into a onehot representation def to_categorical_numpy(integer_vector): # integer_vector[n_inputs] contains number between 0...9 n_inputs = len(integer_vector) # inputs n_categories = max(integer_vector) + 1 # 10 chategories onehot_vector = zeros((n_inputs, n_categories),dtype=int) onehot_vector[range(n_inputs), integer_vector] = 1 return onehot_vector # In[5]: integer_vector=[3,5,4,8,0] to_categorical_numpy(integer_vector) # In[6]: Y_train_onehot, Y_test_onehot = to_categorical_numpy(Y_train), to_categorical_numpy(Y_test) #Figure: NN for recognizing digits.
# As said before, the input and the adjusted hot output data have dimensions # \begin{eqnarray} # && X\in [n\times 64]\\ # && Y \in [n\times 10]. # \end{eqnarray} # # We will use 50 neurons in the hidden layer, and we have 10 categories, hence our weights will have dimenions: # \begin{eqnarray} # &&w^{(1)}\in [64\times 50]\\ # &&b^{(1)}\in [50]\\ # &&w^{(2)}\in [50\times 10]\\ # &&b^{(2)}\in 10\\ # &&a^{(2)} \in [n\times 10] # \end{eqnarray} # # # The equations for our NN models are: # \begin{eqnarray} # && z^{(1)} = X w^{(1)} + b^{(1)} \in [n\times 50]\\ # && a^{(1)} = f^{(1)}(z^{(1)}) \in [n\times 50]\\ # && z^{(2)} = a^{(1)} w^{(2)} + b^{(2)} \in [n\times 10]\\ # && a^{(2)} = f^{(2)}(z^{(2)}) \in [n\times 10] # \end{eqnarray} # # where $$f^{(1)}(z)=1/(\exp(-z)+1)$$ and # $$f^{(2)}(z_c)=\frac{\exp{z_c}}{\sum_{c'=0}^9 \exp{z_{c'}}}$$ # # # Note that the output layer uses the `softmax` activation function, because we have the multiple-choice output. The cost function in this case has to maximize the cross entropy, i.e., the probability that the model gets all the answers correct, which is given by # \begin{eqnarray} # P({\cal D}|\{w,b\}) = \prod_{i=1}^n \prod_{c=0}^9 P(y_{ic}=1)^{y_{ic}} (1-P(y_{ic}=1)^{1-y_{ic}} # \end{eqnarray} # here $y_{ic}$ can only take values of 0 or 1, and $c$ runs from 0 to 9, and $i$ runs over all input data $n$. Here $\cal D$ is the collection of all input data. This is facilitated by the hot vector representation implemented above, in which $y\in[0,1,...9]$ is changed to hot representation with $y_{ic}$. # # To maximize $P({\cal D}|\{w,b\})$ we minimize $C(\{w,b\})=-\log(P({\cal D}|\{w,b\}))$. The cost function therefore is # \begin{eqnarray} # C(\{w,b\})=-\sum_{i,c} y_{ic} \log(P_{ic})+(1-y_{ic})\log(1-P_{ic}) # \end{eqnarray} # # Note that $a_{ic}^{(2)}\equiv P_{ic}$ is the result of our NN. # # # Later we we also regularize the cost function with $L_2$ metric in the following way: # \begin{eqnarray} # C(\{w,b\})=-\sum_{i,c} y_{ic} \log(P_{ic})+(1-y_{ic})\log(1-P_{ic}) + \frac{\lambda}{2} \sum_{hc} (w_{hc}^{(2)})^2 +\frac{\lambda}{2}\sum_{ph}(w_{ph}^{(1)})^2 # \end{eqnarray} # # First we create a random configuration of weights. # In[7]: # building our neural network n_inputs, n_features = X_train.shape n_hidden_neurons = 50 n_categories = 10 # we make the weights normally distributed using numpy.random.randn def GiveStartingRandomWeights(): random.seed(0) # weights and bias in the hidden layer W_1 = random.randn(n_features, n_hidden_neurons) b_1 = zeros(n_hidden_neurons) + 0.01 # weights and bias in the output layer W_2 = random.randn(n_hidden_neurons, n_categories) b_2 = zeros(n_categories) + 0.01 return (W_1, b_1, W_2, b_2) # Next we evaluate NN by forward algorithm, and check the accuracy of its predictions. # In[8]: def mfermi(x): return 1/(1 + exp(-x)) def feed_forward(X, all_weights): "identical to feed_forward, except we also return a_1, i.e, hidden layer a" W_1, b_1, W_2, b_2 = all_weights # weighted sum of inputs to the hidden layer z_1 = matmul(X, W_1) + b_1 # activation in the hidden layer a_1 = mfermi(z_1) # weighted sum of inputs to the output layer z_2 = matmul(a_1, W_2) + b_2 # softmax output # axis 0 holds each input and axis 1 the probabilities of each category exp_term = exp(z_2) probabilities = exp_term/sum(exp_term, axis=1, keepdims=True) # for backpropagation need activations in hidden and output layers return a_1, probabilities # we obtain a prediction by taking the class with the highest likelihood def predict(X, all_weights): a_1, probabilities = feed_forward(X, all_weights) return (probabilities,argmax(probabilities, axis=1)) # Checking prediction of NN for one data point. The weights are not yet optimized. # In[9]: all_weights = GiveStartingRandomWeights() (probabilities,predictions) = predict(X_train, all_weights) print("probabilities = (n_inputs, n_categories) = " + str(probabilities.shape)) print("probability that image 0 is in category 0,1,2,...,9 = \n" + str(probabilities[0])) print("probabilities sum up to: " + str(probabilities[0].sum())) print() print("predictions = (n_inputs) = " + str(predictions.shape)) print("prediction for image 0: " + str(predictions[0])) print("correct label for image 0: " + str(Y_train[0])) # We include `accuracy_score` from sklearn to meassure how large percentage of data is correctly predicted. # # In[10]: from sklearn.metrics import accuracy_score (probabilities,predictions) = predict(X_train, all_weights) print("Old accuracy on training data:", accuracy_score(predictions, Y_train)) # Next we implement gradients, which are used for back-propagation in function `backpropagation`. # # The gradients are somewhat different than derived above because the cost function is obtained from the cross entropy function. Lets firts use cost function $C$ withouth regularization $\lambda$. # # The gradients are: # \begin{eqnarray} # \frac{\partial C}{\partial w_{jc}^{(2)}}=-\sum_i \left(\frac{y_{ic}}{P_{ic}}-\frac{1-y_{ic}}{1-P_{ic}}\right) # \frac{\partial P_{ic}}{\partial w_{jc}^{(2)}}= # -\sum_i \frac{y_{ic}-P_{ic}}{P_{ic}(1-P_{ic})} # \frac{\partial P_{ic}}{\partial w_{jc}^{(2)}} # \end{eqnarray} # Next # $$\frac{\partial P_{ic}}{\partial w_{jc}^{(2)}}=\frac{\partial P_{ic}}{\partial z_{ic}}\frac{\partial z_{ic}}{\partial w_{jc}}$$ # # Since $P_{ic}=f^{(2)}(z_{ic}^{(2)})$ and # $z_{ic}^{(2)} = \sum_{j\in hidden} a_{i j}^{(1)} w_{j c}^{(2)} + b_{c}^{(2)}$ # we have # $$\frac{\partial P_{ic}}{\partial w_{jc}^{(2)}}=P_{ic}(1-P_{ic}) a_{ij}^{(1)}$$ # which finally gives # # \begin{eqnarray} # \frac{\partial C}{\partial w_{jc}^{(2)}}= # \sum_i (P_{ic}-y_{ic}) a_{ij}^{(1)} = {a^{(1)}}^T (a^{(2)}-Y) # \end{eqnarray} # where we took into account that $a_{ic}^{(2)}=P_{ic}$ and $Y_{ic}=y_{ic}$. # Similarly we can see that # \begin{eqnarray} # \frac{\partial C}{\partial b_{c}^{(2)}}= # \sum_i (P_{ic}-y_{ic}) # \end{eqnarray} # Next we evaluate the derivative in the hidden layer, i.e., # \begin{eqnarray} # \frac{\partial C}{\partial w_{ph}^{(1)}}=\sum_i \frac{\partial C}{\partial P_{ic}} # \frac{\partial P_{ic}}{\partial z_{ic}^{(2)}}\frac{\partial z_{ic}^{(2)}}{\partial a_{ih}^{(1)}} # \frac{\partial a_{ih}^{(1)}}{\partial z_{ih}^{(1)}} # \frac{\partial z_{ih}^{(1)}}{\partial w_{ph}^{(1)}} # \end{eqnarray} # which comes from the fact that $P_{ic}=f^{(2)}(z_{ic}^{(2)})$, $z_{ic}^{(2)}=\sum_h a_{ih}^{(1)} w_{hc}^{(2)}+b_c^{(2)}$ and $a_{ih}^{(1)}=f^{(1)}(z_{ih}^{(1)})$ and $z_{ih}^{(1)}=\sum_p X_{ip} w_{ph}^{(1)} +b_h$. # We see that $\frac{\partial C}{\partial P_{ic}}=(P_{ic}-y_{ic})/(P_{ic}(1-P_{ic}))$, further $\frac{\partial P_{ic}}{\partial z_{ic}^{(2)}}=P_{ic}(1-P_{ic})$, $\frac{\partial z_{ic}^{(2)}}{\partial a_{ih}^{(1)}}=w^{(2)}_{hc}$, $\frac{\partial a_{ih}^{(1)}}{\partial z_{ih}^{(1)}}=a_{ih}^{(1)}(1-a_{ih}^{(1)})$, $\frac{\partial z_{ih}^{(1)}}{\partial w_{ph}^{(1)}}=X_{ip}$. # # # Taking all this into account, we get # \begin{eqnarray} # \frac{\partial C}{\partial w_{ph}^{(1)}}= # \sum_i X_{ip} a_{ih}^{(1)}(1-a_{ih}^{(1)})\sum_c (P_{ic}-y_{ic}) w_{hc}^{(2)} # \end{eqnarray} # which can also be written as # \begin{equation} # \frac{\partial C}{\partial w_{ph}^{(1)}}= (X^T ( a^{(1)}\circ (1-a^{(1)})\circ (a^{(2)}-Y) ({w^{(2)}})^T ))_{ph} # \end{equation} # where we introduced elementwise product $\circ$ defined by $c_{ih}=a_{ih} b_{ih}$ as $c= a\circ b$ . # Similarly # \begin{eqnarray} # \frac{\partial C}{\partial b_{h}^{(1)}}= # \sum_{i,h} a_{ih}^{(1)}(1-a_{ih}^{(1)})\sum_c (P_{ic}-y_{ic}) w_{hc}^{(2)} # \end{eqnarray} # # # Finally, when $\lambda$ is nonzero, we will just add to derivatives # \begin{eqnarray} # \frac{\partial C}{\partial w_{ph}^{(1)} } += \lambda w_{ph}^{(1)}\\ # \frac{\partial C}{\partial w_{jc}^{(2)}} += \lambda w_{jc}^{(2)} # \end{eqnarray} # In[11]: def backpropagation(X, Y, all_weights): a_1, probabilities = feed_forward(X, all_weights) W_1, b_1, W_2, b_2 = all_weights # error in the output layer error_output = probabilities - Y # error in the hidden layer error_hidden = matmul(error_output, W_2.T) * a_1 * (1 - a_1) # gradients for the output layer dW2 = matmul(a_1.T, error_output) dB2 = sum(error_output, axis=0) # gradient for the hidden layer dW1 = matmul(X.T, error_hidden) dB1 = sum(error_hidden, axis=0) return dW2, dB2, dW1, dB1 # In[13]: dW2, dB2, dW1, dB1 = backpropagation(X_train, Y_train_onehot, all_weights) print('shapes of gradients=', dW2.shape, dB2.shape, dW1.shape, dB1.shape) # First we use simple gradient descendent method with fixed `learning rate` $\gamma$=`eta`. We evaluate gradient `num_iterations`-times and move towards local minimum. # In[14]: def SimpleGradientMethod(X_train, Y_train, all_weights, eta, lmbd, num_iterations): (W_1,b_1,W_2,b_2) = all_weights for i in range(num_iterations): # calculate gradients dW_2, dB_2, dW_1, dB_1 = backpropagation(X_train, Y_train, [W_1,b_1,W_2,b_2]) # regularization term gradients dW_2 += lmbd * W_2 dW_1 += lmbd * W_1 # update weights and biases W_1 -= eta * dW_1 b_1 -= eta * dB_1 W_2 -= eta * dW_2 b_2 -= eta * dB_2 return (W_1,b_1,W_2,b_2) # We also add regularization to cost function in the form of $\lambda ||w||_2^2$. The precision after 100 steps is barely improved. # In[15]: eta = 0.01 lmbd = 0.01 num_iterations=100 all_weights = GiveStartingRandomWeights() all_weights = SimpleGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, num_iterations) error=accuracy_score(predict(X_train,all_weights)[1],Y_train) print('Accuracy on training data: ', error) # Next we implement **stochastic gradient descent (SGD)**, which takes a random subset of data (of size `batch_size`), and we compute gradient only for the subset of points. We then move in the steepest descent direction for only this subset. # The randomness introduced this way decreases the chance # that our opmization scheme gets stuck in a local minima. # # If the size of the minibatches are small relative to the number of # datapoints ($M < n$), the computation of the gradient is much # cheaper since we sum over the datapoints in the $k-th$ minibatch and not # all $n$ datapoints. # In[16]: def StochasticGradientMethod(X_train, Y_train, all_weights, eta, lmbd, batch_size, epochs): (W_1,b_1,W_2,b_2) = all_weights data_indices = arange(len(X_train)) iterations = len(X_train) // batch_size print('Number of iterations=', iterations) for i in range(epochs): for j in range(iterations): chosen_datapoints = random.choice(data_indices, size=batch_size, replace=False) # minibatch training data X_batch = X_train[chosen_datapoints] Y_batch = Y_train[chosen_datapoints] dW_2, dB_2, dW_1, dB_1 = backpropagation(X_batch, Y_batch, [W_1,b_1,W_2,b_2]) # regularization term gradients dW_2 += lmbd * W_2 dW_1 += lmbd * W_1 # update weights and biases W_1 -= eta * dW_1 b_1 -= eta * dB_1 W_2 -= eta * dW_2 b_2 -= eta * dB_2 return (W_1,b_1,W_2,b_2) # Finally we use SG method for learning method $\gamma$=`eta`=0.01 and $\lambda=0.1$ wih `batch_size=100`. The number of iteration over the minibathces (`epochs`) is also choosen at 100. This gives excellent prediction over 98%. A human can typically read with accuracy 98%. # In[17]: eta = 0.01 lmbd = 0.1 epochs = 100 batch_size = 100 all_weights = GiveStartingRandomWeights() all_weights = StochasticGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, batch_size, epochs) error=accuracy_score(predict(X_train,all_weights)[1],Y_train) error2=accuracy_score(predict(X_test,all_weights)[1],Y_test) print('Accuracy on training data: ', error, error2) # ### Adjust hyperparameters # # We now perform a grid search to find the optimal hyperparameters for the network. # Note that we are only using 1 layer with 50 neurons, and human performance is estimated to be around $98\%$ ($2\%$ error rate). # In[18]: eta_vals = logspace(-5, 1, 7) lmbd_vals = logspace(-5, 1, 7) # store the models for later use DNN_numpy = zeros((len(eta_vals), len(lmbd_vals)), dtype=object) # grid search for i, eta in enumerate(eta_vals): for j, lmbd in enumerate(lmbd_vals): all_weights = GiveStartingRandomWeights() all_weights = StochasticGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, batch_size, epochs) error=accuracy_score(predict(X_train,all_weights)[1],Y_train) error2=accuracy_score(predict(X_test,all_weights)[1],Y_test) DNN_numpy[i][j] = error #test_predict = dnn.predict(X_test) print('Learning rate=', eta, 'Lambda=', lmbd, 'Accuracy=', error, error2) # In[ ]: