#!/usr/bin/env python # coding: utf-8 #
# # UFF logo # # # IC logo # #
# ### Machine Learning # # 4. Artificial Neural Networks # ### [Luis Martí](http://lmarti.com) # #### [Instituto de Computação](http://www.ic.uff) # #### [Universidade Federal Fluminense](http://www.uff.br) # $\newcommand{\vec}[1]{\boldsymbol{#1}}$ # In[1]: import random, itertools import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.cm as cm from mpl_toolkits.mplot3d import Axes3D # In[2]: plt.rc('text', usetex=True); plt.rc('font', family='serif') plt.rcParams['text.latex.preamble'] ='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}' import seaborn seaborn.set(style='whitegrid'); seaborn.set_context('talk') get_ipython().run_line_magic('matplotlib', 'inline') get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'") # In[3]: random.seed(a=42) # In[4]: from ipywidgets import interact, interactive, fixed import ipywidgets as widgets # In[5]: # tikzmagic extesion for figures - https://github.com/mkrphys/ipython-tikzmagic get_ipython().run_line_magic('load_ext', 'tikzmagic') # for rendering graphviz dot files - https://github.com/cjdrake/ipython-magic get_ipython().run_line_magic('load_ext', 'gvmagic') # # Why to study bio-inspired methods # # * Nature is one of the best problem-solvers we know. # * Evolutionary optimization. # * Natural intelligence and artificial intelligence # * Cellular automata # * **Neural computation** # * Evolutionary computation # * Swarm intelligence # * Artificial immune systems # * Membrane computing # * Amorphous computing # ## Pigeons as art connoisseurs (Watanabe et al., 1995) # * Pigeons were put in a Skinner box, and # * presented with photos of paintings by Monet and Picasso. # * They were rewarded if they recognized correctly the painter they were presented with. #
#
#
#
# [Skinner box](https://en.wikipedia.org/wiki/Operant_conditioning_chamber) # #
#
# [Claude Monet](https://en.wikipedia.org/wiki/Claude_Monet) # #
#
# [Pablo Picasso](https://en.wikipedia.org/wiki/Pablo_Picasso) # #
#
#
# Watanabe, S., Sakamoto, J., & Wakita, M. (1995). Pigeons’ discrimination of paintings by Monet and Picasso. Journal of the Experimental Analysis of Behavior, 63(2), 165–174. http://doi.org/10.1901/jeab.1995.63-165 # ## Results # # * Pigeons were capable of discriminate between painters with an accuracy of **95%** when confronted with paintings on the **training set**. # * Surprinsingly, they scored a **85%** on paintings they had never seen during training (**validation set**). # * They were not just learning exhaustively which painting belonged to each painter. # * They were able to recognize **styles** or **patterns** and # * to **generalize** from what they had seen before. # # > In AI, we have been trying to replicate this capacity (and many others) in a computer for about 60 years. # # Artificial neural networks # # * Inspired (at different degrees) on the brain and the nervous system. # * Massive parallelization of relatively simple processing units. # * Simple principles lead to complex behaviours. # * Capable of learn form data. # ## Creating [artificial neurons](https://en.wikipedia.org/wiki/Artificial_neuron) # # Artificial neurons are designed to mimic aspects of their biological counterparts. #
#
#
#
# #
#
#
    #
  • **Dendrites** – act as the input receptor, allowing the cell to receive signals from a large (>1000) number of neighboring neurons. Each dendrite is able to perform a "multiplication" by that dendrite's "weight value." # #
  • **Soma** – acts as a summation function. As positive and negative signals (exciting and inhibiting, respectively) arrive in the soma from the dendrites they are added together. # #
  • **Axon** – gets its signal from the summation behavior which occurs inside the soma. The opening to the axon samples the electrical potential inside the soma. Once the soma reaches a certain potential, the axon will transmit an all-in signal pulse down its length. #
  • In this regard, the axon behaves as the ability for us to connect our artificial neuron to other artificial neurons. #
#
#
#
# # # #### Note: # # * Unlike most artificial neurons, biological neurons fire in discrete pulses. # * Each time the electrical potential inside the soma reaches a certain threshold, a pulse is transmitted down the axon. # * This pulsing can be translated into continuous values. The rate (activations per second, etc.) at which an axon fires converts directly into the rate at which neighboring cells get signal ions introduced into them. # * The faster a biological neuron fires, the faster nearby neurons accumulate electrical potential (or lose electrical potential, depending on the "weighting" of the dendrite that connects to the neuron that fired). # * It is this conversion that allows computer scientists and mathematicians to simulate biological neural networks using artificial neurons which can output distinct values (often from −1 to 1). # The first artificial neuron was the **Threshold Logic Unit (TLU)**, proposed by McCulloch and Pitts (1943). # * The model was specifically targeted as a computational model of the "nerve net" in the brain. # * As a transfer function, it employed a threshold, equivalent to using the Heaviside step function. # * A simple model was considered, with binary inputs and outputs, and # * restrictions on the possible weights, and a more flexible threshold value. # # * Any boolean function could be implemented by networks of such devices, what is easily seen from the fact that one can implement the `AND` and `OR` functions, and use them in the disjunctive or the conjunctive normal form. # * Cyclic TLU networks, with feedbacks through neurons, could define dynamical systems with memory, but # * most research concentrated (and still does) on strictly feed-forward networks because of the smaller difficulty they pose. # McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133. http://link.springer.com/article/10.1007%2FBF02478259 # ## Artificial neuron as a neuron abstraction # In[6]: get_ipython().run_line_magic('tikz', '-s 700,200 -sc 1.0 -l shapes,calc,shapes,arrows -f svg \\input{imgs/05/neuron.tikz}') # In general terms, an input $\vec{x}\in\mathbb{R}^n$ is multiplied by a weight vector $\vec{w}$ and added a bias $b$ producing the net activation, $\text{net}$. $\text{net}$ is passed to the *activation function* $f()$ that computed the neuron's output $\hat{y}$. # $$ # \hat{y} = f\left(\text{net}\right)= f\left(\vec{w}\cdot\vec{x}+b\right) = f\left(\sum_{i=1}^{n}{w_i x_i + b}\right). # $$ # **Note:** This is a rather simplistic approximation of natural neurons. See [Spiking Neural Networks](https://en.wikipedia.org/wiki/Spiking_neural_network) for a more biologically plausible representation. # # The Perceptron # # The [Perceptron](https://en.wikipedia.org/wiki/Perceptron) and its learning algorithm pioneered the research in neurocomputing. # # * The perceptron is an algorithm for learning a linear binary classifier. # * That is a function that maps its input $\vec{x}\in\mathbb{R}^n$ (a real-valued vector) to an output value $f(\vec{x})$ (a single binary value) as, # # $$ # f(\vec{x}) = \begin{cases} # 1 & \text{if }\vec{w} \cdot \vec{x} + b > 0\,,\\ # 0 & \text{otherwise}; # \end{cases} # $$ # # where $\vec{w}$ is a vector of real-valued *weights*, $\vec{w} \cdot \vec{x}$ is the *dot product* $\sum_{i=1}^n w_i x_i$, and $b$ is known as the *bias*. # ## Perceptron learning # # Learning goes by calculating the prediction of the perceptron, $\hat{y}$, as # # $$\hat{y} = f\left(\vec{w}\cdot\vec{x} + b) = f( w_{1}x_{1} + w_2x_{2} + \cdots + w_nx_{n}+b\right)\,.$$ # # After that, we update the weights and the bias using the perceptron rule: # # $$ # \begin{align*} # w_i & = w_i + \alpha (y - \hat{y}) x_{i} \,,\ i=1,\ldots,n\,;\\ # b & = b + \alpha (y - \hat{y})\,. # \end{align*} # $$ # # Here $\alpha\in\left(0,1\right]$ is known as the *learning rate*. # ## Mark I: The Perceptron implementation # #
#
#
 
#
#
#
# #
#
#
#
  • An array of cadmium sulfide photocells was used for capturing 20x20 (400) pixel images that were used as inputs. #
  • A switchboard was used for manually selecting which input elements (pixels) were passed to the perceptrons. #
  • They used potentiometers as variable weights. #
  • Electric motors automatically modified the weights.
#
#
 
#
#
# # Implementing the Perceptron # # We are going to start implementing a perceptron as a class. # In[7]: class Perceptron: 'A simple Perceptron implementation.' def __init__(self, weights, bias, alpha=0.1): self.weights = weights self.bias = bias self.alpha = alpha def propagate(self, x): return self.activation(self.net(x)) def activation(self, net): if net > 0: return 1 return 0 def net(self, x): return np.dot(self.weights, x) + self.bias def learn(self, x, y): y_hat = self.propagate(x) self.weights = [ w_i + self.alpha*x_i*(y-y_hat) for (w_i, x_i) in zip(self.weights, x)] self.bias = self.bias + self.alpha*(y-y_hat) return np.abs(y_hat - y) # **Note**: Bear in mind that I have made the implementation as clear and easy to follow as possible and, therefore, I have sacrificed performance in the sake of clarity. There are many points where it can be improved. # ### Testing our `Perceptron` class. # # After having the perceptron implementation ready we need an example data set. # # We are going to create a dataset containing random points in $\left[0,1\right]^2$. # In[8]: size = 50 # size of data set # In[9]: data = pd.DataFrame(columns=('$x_1$', '$x_2$'), data=np.random.uniform(size=(size,2))) # So far, our data set looks like this (we are showning only the first ten elements): # In[10]: data.head(10) # We need to add a *target* or *classification* attribute. In this example, we are going to make this target to be equal to one if the point lies in the upper-right triangle of the $\left[0,1\right]\times\left[0,1\right]$ square and zero otherwise: # #
#
#
 
#
#
# description of the data set #
#
#
 
#
#
# We can formalize this condition as: # # $$ # y = \begin{cases} # 1 & \ \text{if}\ x_1 + x_2 > 1\,,\\ # 0 & \ \text{otherwise}\,. # \end{cases} # $$ # Lets code it... # In[11]: def condition(x): return int(np.sum(x) > 1) # ...and apply it to the data set. # In[12]: data['y'] = data.apply(condition, axis=1) # The resulting data set looks like this: # In[13]: data.head(10) # We can now take a better look at the data set in graphical form. Elements with $y=1$ are shown in green ($\color{green}{\bullet}$) and those with $y=0$ are shown in red ($\color{red}{\bullet}$): # In[14]: def plot_data(data, ax): data[data.y==1].plot(kind='scatter', x='$x_1$', y='$x_2$', color='green', ax=ax) data[data.y==0].plot(kind='scatter', x='$x_1$', y='$x_2$', color='red', ax=ax) ax.set_xlim(-0.1,1.1); ax.set_ylim(-0.1,1.1) # In[15]: fig = plt.figure(figsize=(5,5)) plot_data(data, fig.gca()) # ## Iterating the data set # # Having the data set we can now code how the perceptron learns it by iterating throu it. # In[16]: def learn_data(perceptron, data): 'Returns the number of errors made.' count = 0 for i, row in data.iterrows(): count += perceptron.learn(row[0:2], row[2]) return count # ## Visualizing learning # # We need now to plot the decision boundary or threshold of the perceptron. # # To calculate it we start with the equation that describes the boundary, # $$w_1x_1+w_2x_2 + b =0.$$ # # From it we can obtain $x_2$ from a given $x_1$ applying a fairy simple math, # $$x_2 = \frac{-w_1x_1-b}{w_2}.$$ # In[17]: def threshold(perceptron, x_1): return (-perceptron.weights[0] * x_1 - perceptron.bias) / perceptron.weights[1] # In[18]: def plot_perceptron_threshold(perceptron, ax): xlim = ax.get_xlim(); ylim = ax.get_ylim() x2s = [threshold(perceptron, x1) for x1 in xlim] ax.plot(xlim, x2s) ax.set_xlim(-0.1,1.1); ax.set_ylim(-0.1,1.1) # A function that plots a perceptron as the threshold and the data set. # In[19]: def plot_all(perceptron, data, t, ax=None): if ax==None: fig = plt.figure(figsize=(5,4)) ax = fig.gca() plot_data(data, ax) plot_perceptron_threshold(perceptron, ax) ax.set_title('$t='+str(t+1)+'$') # ## Our perceptron in action # # All set now! Let's create a perceptron and train it. # # **Note**: Normally the initial weights and the bias should be set to *small* random values. I am setting them by hand to a value that I know that looks good in the examples. # In[20]: perceptron = Perceptron([0.1,-0.1],0.02) # In[21]: f, axarr = plt.subplots(2, 4, sharex=True, sharey=True, figsize=(10,6)) axs = list(itertools.chain.from_iterable(axarr)) for t in range(8): plot_all(perceptron, data, t, ax=axs[t]) learn_data(perceptron, data) f.tight_layout() # It is clear how the Perceptron threshold is progresively adjusted according to the data set. # ## Animating the Perceptron # # This results are better understood in animated from. # In[22]: from matplotlib import animation from IPython.display import HTML # In[23]: def animate(frame_index, perceptron, data, ax): ax.clear() plot_data(data, ax=ax) ax.set_title('$t='+str(frame_index)+'$') if not frame_index: return None plot_perceptron_threshold(perceptron, ax=ax) learn_data(perceptron, data) return None # In[24]: fig = plt.figure(figsize=(4,4)) ax = fig.gca() perceptron = Perceptron([0.1,-0.1],0.02) anim = animation.FuncAnimation(fig, lambda i: animate(i, perceptron, data, ax), frames=10, interval=600, blit=False) plt.tight_layout() plt.close() # In[25]: HTML(anim.to_html5_video()) # ## Self-study # # * Experiment with the learning rate ($\alpha$). How it impacts learning? Do you remember if we have seen another similar parameter in previous classes? # * Create a new data set with a non-linear boundary. What happens now with our perceptron? How would you fix it? # ## Suggested reading # # * Minsky M. L. and Papert S. A. (1969). *Perceptrons*. Cambridge, MA: MIT Press. # * Gallant, S. I. (1990). *Perceptron-based learning algorithms*. IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191. # * Mikel Olazaran (1996). *A Sociological Study of the Official History of the Perceptrons Controversy*. Social Studies of Science 26 (3): 611–659. doi:10.1177/030631296026003005. # ## The XOR issue # # Take a dataset that contains all the possible value combination of the logical XOR function: # In[26]: X = np.array([[0, 1], [1, 0], [0, 0], [1, 1]]) Y = np.array([1, 1, 0, 0]) N = Y.shape[0] # The data set has ones (represented in black) when $x_1 = 1$ and $x_2 = 0$ or when $x_1 = 0$ and $x_2 = 1$, as defined for the XOR function. # In[27]: plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cm.gray_r, s=100, ) plt.xlabel('$x_1$');plt.ylabel('$x_2$'); # It is evident that a perceptron is unable to solve this simple problem as it is only able to separate the space by a hyperplane. # To solve the XOR problem we need to stack perceptrons # In[28]: get_ipython().run_line_magic('dotstr', "open('imgs/05/xor-nn.dot', 'r').read()") # But, how can we train the weights from N1 and N2? # * This is known as a credit assigment problem. # * Our current methods can dont determine how the weights of N1 and N2 influence the error. # * Lets visualize it. # ### A "handcoded" XOR neural network # # Forward propagation for 2 inputs (x1, x2), 2 hidden nodes, 1 output. # * We will be extending the perceptron activation as the logistic (*logit*) function # $$\hat{y} = f(\text{net}) = \frac{1}{1+\exp{\left(-\text{net}\right)}}.$$ # In[29]: def logit(a): return 1.0 / (1+np.exp(-a)) # In[30]: def fprop(x1, x2, w1= 0.1, w2= 0.2, b1= 0.1, w3=-0.2, w4= 0.2, b2=-0.1, w5=-0.3, w6=-0.25, b3=0.2): y_hat_1 = logit(b2 + w3*x1 + w4*x2) # N1 y_hat_2 = logit(b3 + w5*x1 + w6*x2) # N2 return logit(b1 + w1*y_hat_1 + w2*y_hat_2) # ## Error Surface of the XOR Problem # In[31]: @interact(i=(1,6), j=(1,6)) def error_plot(i=5, j=6): W1, W2 = np.meshgrid(np.arange(-10, 10, 0.5), np.arange(-10, 10, 0.5)) E = np.sum([(fprop(X[n, 0], X[n, 1], **{"w%d"%(i) : W1, "w%d"%(j) : W2})-Y[n])**2 for n in range(N)], axis=0) ax = plt.figure(figsize=(7, 4.5)).add_subplot(111, projection="3d") surf = ax.plot_surface(W1, W2, E, rstride=1, cstride=1, cmap=cm.viridis, lw=0.11, alpha=0.74) plt.setp(ax, xlabel="$w_%d$" % (i), ylabel="$w_%d$" % (j), zlabel="$E()$"); # ## Local Minima # In[32]: @interact(i=(0,5), j=(0,5)) def errors_plot(i=0, j=1): plt.figure(figsize=(12, 3)) W = np.arange(-10, 10, 0.25) errors = [(fprop(X[n, 0], X[n, 1], **{"w%d"%(i+1) : W, "w%d"%(j+1) : W+2})-Y[n])**2 for n in range(N)] plt.subplot(1, 2, 1) for n in range(N): plt.plot(W, errors[n], label="$E^{(%d)}$" % (n+1)) plt.setp(plt.gca(), xlabel="$w$", ylabel="$E$", title='Split errors');plt.legend(loc="best", frameon=True) plt.subplot(1, 2, 2) plt.plot(W, np.sum(errors, axis=0), label="$E(\cdot)$") plt.setp(plt.gca(), xlabel="$w$", ylabel="$E$", title='Total error');plt.legend(loc="best", frameon=True); # # The Multilayer Perceptron (MLP) # The composition of layers of perceptrons can capture complex relations between inputs and outputs in a hierarchical way. # In[33]: get_ipython().run_line_magic('tikz', '-s 600,400 -sc 1.0 -l positioning -f svg \\input{imgs/05/neural-network.tikz}') # ...but how can we adapt the weights of the neurons in the hidden layers? # ## Improving the notation # In order to proceed we need to improve the notation we have been using. That for, for each layer $1\geq l\geq L$, the activations and outputs are calculated as: # # $$\text{net}^l_j = \sum_i w^l_{ji} x^l_i$$ # $$y^l_j = f^l(\text{net}^l_j)$$ # # where # # * $y^l_j$ is the $j$th output of layer $l$, # * $x^l_i$ is the $i$th input to layer $l$, # * $w^l_{ji}$ is the weight of the $j$-th neuron connected to input $i$, # * $\text{net}^l_{j}$ is called net activation, and # * $f^l(\cdot)$ is the activation function of layer $l$, e.g. $\tanh()$ in the hidden layers and the identity in the last layer (for regression) # ## Error function (SSE) # # * For $\Psi=\left\{\left<\vec{x}^{(1)},\vec{y}^{(1)}\right>,\ldots,\left<\vec{x}^{(k)},\vec{y}^{(k)}\right>,\ldots\left<\vec{x}^{(K)},\vec{y}^{(K)}\right>\right\}$ # # $$ # E = # \frac{1}{2} \sum_{k=1}^{K}{\ell(\hat{\vec{y}}_j^{L}(\vec{x}^{(k)}), \vec{y}^{(k)})} = # \frac{1}{2} \sum_{k=1}^{K} \sum_{j=1}^{m} \left( \hat{y}_j^{L}(\vec{x}^{(k)}) - y_j^{(k)} \right)^2 # $$ # ## Training MLPs with Backpropagation # # * Backpropagation of errors is a procedure to compute the **gradient of the error function with respect to the weights** of a neural network. # * We can use the gradient from backpropagation to apply **gradient descent**! # #### A math flashback # # The Chain Rule can be applied in composite functions as, # $$ # \left( f \circ g\right)'(x) = \left(f\left(g\left(x\right)\right)\right)'= f'\left(g(x)\right)g'(x). # $$ # or, in Leibniz notation, # $$ # \frac{\partial f\left(g\left(x\right)\right)}{\partial x} = # \frac{\partial f\left(g\left(x\right)\right)}{\partial g\left(x\right)} \cdot # \frac{\partial g\left(x\right)}{\partial x} # $$ # # The **total derivative** of $f(x_1,x_2,...x_n)$ on $x_i$ is # $$ # \frac{\partial f}{\partial x_i}= # \sum_{j=1}^n{\frac{\partial f}{\partial x_j}\cdot\frac{\partial x_j}{\partial x_i}} # $$ # #### To apply gradient descent we need... to calculate the gradients # # Applying the chain rule, # $$ # \frac{\partial \ell}{\partial w^l_{ji}}= # \color{blue}{\overbrace{\frac{\partial \ell}{\partial \text{net}^l_j}}^{\delta^l_j}} # \color{forestgreen}{\underbrace{\frac{\partial{\text{net}^l_j}}{\partial w^l_{ji}}}_{\frac{\partial\left(\sum_{i}w^l_{ji}x^l_i\right)}{\partial w^l_{ji}}=x^l_i}} # $$ # hence we can write # $$ # \frac{\partial \ell}{\partial w^l_{ji}}= # \color{blue}{\delta^l_j} # \color{forestgreen}{x^l_i} # $$ # #### For the output layer ($l=L$) # $$ # \delta^L_j = \frac{\partial \ell}{\partial \text{net}^L_j} = # \color{red}{ # \overbrace{ # \frac{\partial \ell}{\partial\hat{y}^L_j}}^{\frac{\partial\left(\frac{1}{2}\sum_j{\left(y_j\hat{y}^L_j\right)^2} \right)} # {\partial\hat{y}^L_j}=\left(y_j-\hat{y}^L_j\right)}} # \cdot # \color{magenta}{ # \underbrace{\frac{\partial\hat{y}^L_j}{\text{net}^l_j}}_{f'(\text{net}_j^L)}} # =\color{red}{\left(y_j-\hat{y}^L_j\right)}\color{magenta}{f'(\text{net}_j^L)}. # $$ # therefore # $$ # \frac{\partial \ell}{\partial w^L_{ji}}= # \color{red}{\left(y_j-\hat{y}^L_j\right)}\color{magenta}{f'(\text{net}_j^L)} # \color{forestgreen}{x^L_i} # $$ # #### What about the hidden layers ($1\leq l