Thank you for finding your way to my tutorial series on Artificial Neural Networks (ANNs). These guides are intended to be an introduction to Machine Learning (ML) applications of neural nets for people who have at least a moderate level of experience with Python, NumPy, data analysis, linear algebra and some calculus. It is my hope that these tutorials will help to demystify the inner workings of neural networks and give the reader a strong starting point for designing and engineering neural nets to solve your own unique problems.
This tutorial is written in a Jupyter Notebook in Python3 so that you can actually run the experiments as you read! I will also assume that you have some experience with array-based programming in NumPy and a basic knowledge of linear algebra and calculus. If you don't have prior experience with Python and NumPy, it is probably a good idea to brush up on those tools first. If you don't have a strong background in mathematics, you can probably continue and skim over some of the details of the derivations.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0,6.0)
In machine learning and data science we frequently think in terms of data and models. If we know the model that generated a given data set, then things are generally made quite easy. It is generally the case, however, that we have some dataset and wish to derive a model that could have potentially generated it. If we can find a good model, then we can make predictions about new data once we encounter it.
Another useful way to think about models is in terms of inputs and outputs. For a given observation (the model inputs) the model generates some type of predictions (the model outputs). Statisticians like to call model inputs "predictors" and model output "response variables" but this always seems less descriptive to me.
Regression is the act of finding an approximate model that maps input variables to outputs variables.
Linear Least-Squares Regression, often simply called Linear Regression, is straight forward and yet incredibly powerful tool for performing regression. Linear Regression assumes that out model outputs can be described as a linear combination of our model inputs. Suppose that a single observation to be fed into our model is a row vector \begin{equation} \textbf{x} = [x_1, x_2, \ldots, x_F] \end{equation} where $F$ is the number of input dimensions to out model, also called the number of features. The the predictions of a linear regression model can be described as \begin{equation} y_j = \sum\limits_{i=0}^{F} x_i \cdot w_{i,j} + b \end{equation} where $b$ is a constant bias value, $w_{i,j}$ is the model weight associated with the $i$'th input dimensions and the $j$'th output dimensions and $y_j$ is the predicted value along the $j$'th output dimension.
Linear regression can also conveniently be written entirely in matrix notation. Let $\mathbf{X} \in \mathbb{R}^{N \times F}$ be a matrix where the columns represent input dimensions and the rows represent observations and $\mathbf{W} \in \mathbb{R}^{F \times K}$ be a matrix of weights. A linear regression model can then be written as \begin{equation} \mathbf{Y} = \mathbf{\tilde{X}} \mathbf{W} \end{equation} where where $K$ is the number of output dimensions and where the tilde above $\mathbf{X}$ denotes that a column of ones has been added in order to incorporates our bias terms and where $\mathbf{Y} \in \mathbb{R}^{N \times K}$ is our matrix of predictions.
A linear regression model can also be visualized as network graph, which illustrates how information flows through the model.
Now that we have established how information flows through an LR model, we need to determine the values of $\mathbf{W}$ that allow the model to actually fit a given dataset.
First need a target dataset $\mathbf{G} \in \mathbb{R}^{N \times K}$.
Then we wish to minimize squared error \begin{align} E(\mathbf{W}) = (\mathbf{Y} - \mathbf{G})^2 = (\mathbf{\tilde{X}}\mathbf{W} - \mathbf{G})^2 \end{align}
From calculus, recall that the derivative of a continuous function must be zero at it's minimum. The gradient of our error function is the matrix of it's derivatives, \begin{align} \nabla E(\mathbf{W}) = & 2 \mathbf{\tilde{X}}^T (\mathbf{\tilde{X}}\mathbf{W} - \mathbf{G}) \\ = & 2 \mathbf{\tilde{X}}^T \mathbf{\tilde{X}}\mathbf{W} - 2 \mathbf{\tilde{X}}^T \mathbf{G}. \\ \end{align}
We can then set the gradient to zero in order to find the local minimum \begin{align} & \nabla E(\mathbf{W}) = 0 \\ \Rightarrow & 2 \mathbf{\tilde{X}}^T \mathbf{\tilde{X}}\mathbf{W} = 2 \mathbf{\tilde{X}}^T \mathbf{G} \\ \Rightarrow & \mathbf{W} = (\mathbf{\tilde{X}}^T \mathbf{\tilde{X}})^{-1} \mathbf{\tilde{X}}^T \mathbf{G} \\ \end{align} As it turns out, linear regression is convex, meaning that there is only one minimum on the error surface. We can verify this by finding the second-order gradient, left as an exercise for the reader
Let's implement linear regression in python...
def bias(v):
"""Add a column of ones for the bias term.
"""
ns = v.shape[0]
return np.hstack((v, np.ones(ns).reshape((ns,-1))))
# example
a = np.arange(5)[:,None]
bias(a)
array([[0., 1.], [1., 1.], [2., 1.], [3., 1.], [4., 1.]])
class LinearRegression(object):
"""Linear least squares regression
"""
def __init__(self, x, g, **kwargs):
self.train(x, g, **kwargs)
def train(self, x, g):
# analytical solution, fails for underdetermined problems
x1 = bias(x)
self.w = np.linalg.solve(x1.T @ x1, x1.T @ g)
def predict(self, x):
return x @ self.w[:-1] + self.w[-1]
Let's consider a simple example of a linear function with normally distributed noise...
ns = 100
x = np.linspace(0, 1, ns)[:,None]
x -= x.mean(axis=0)
x /= x.std(axis=0)
g = 20.0*x + np.random.normal(size=(ns,1))
g -= g.mean(axis=0)
g /= g.std(axis=0)
lm = LinearRegression(x, g)
y = lm.predict(x)
plt.scatter(x, g)
plt.plot(x, y, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
As it's name suggests, linear regression is purely linear in it's inputs, meaning that it is not directly able to model nonlinear curves. Consider the nonlinear example below.
# a quadratic curve
g_curved = 10*x**2 + np.random.normal(size=(ns,1))
g_curved -= g_curved.mean(axis=0)
g_curved /= g_curved.std(axis=0)
lm_curved = LinearRegression(x, g_curved)
y_curved = lm_curved.predict(x)
plt.scatter(x, g_curved);
plt.plot(x, y_curved, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
There are, however, many tricks that can be used to model nonlinear problems using the linear least squares approach. For example, we can transform our inputs so that the model is nonlinear over the original inputs. Below, we combine both $X$ and $X**2$ so that our linear regression model can fit our quadratic curve. The approach can be further extended using the kernel trick, known as kernel ridge regression.
x_curved = np.hstack((x, x**2))
lm_curved2 = LinearRegression(x_curved, g_curved)
y_curved2 = lm_curved2.predict(x_curved)
plt.scatter(x, g_curved);
plt.plot(x, y_curved2, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
One of the primary challenges with transforming the inputs to a linear model is that it requires some level of a priori knowledge about the type of transformation that is necessary. In the above example, we knew that a quadratic transformation was appropriate through visualization. In sophisticated, high-dimensional problems we often have no way to know what type of input transformation will work.
In order to solve this problem in a generic way, we can add a new layer to our model that contains flexible nonlinearities that can automatically learn to fit inputs. This approach is known as a Neural Network (NN) and is also sometimes called a Multilayer Perceptron (MLP).
Universal approximator...
Again, assume that our input matrix $\mathbf{X} \in \mathbb{R}^{N \times F}$. We now have two weight matrices: a hidden weight matrix for our nonlinearities and a visible weight matrix for our final linear regression. $\mathbf{W}_h \in \mathbb{R}^{F+1 \times M}$ and $\mathbf{W}_v \in \mathbb{R}^{M+1 \times K}$
Equations for the forward pass (to evaluate the network): \begin{align} \mathbf{Z} = & \phi( \mathbf{W}_h \mathbf{\tilde{X}} ) \\ \mathbf{Y} = & \mathbf{W}_v \mathbf{\tilde{Z}} \end{align}
Our error function is, again, squared error: \begin{align} E = & (\mathbf{Y} - \mathbf{G})^2 \end{align}
We can find the gradient of the visible layer \begin{align} \nabla_{W_v} E = & \nabla_{Y} E \cdot \nabla_{W_v} Y \\ = & \mathbf{Z}^T 2(\mathbf{Y} - \mathbf{G}) \\ = & \phi(\mathbf{W}_h \mathbf{\tilde{X}})^T 2(\mathbf{Y} - \mathbf{G}) \end{align}
and the gradient of our hidden layer \begin{align} \nabla_{W_h} E = & \nabla_{Y} E \cdot \nabla_{H_v} Y \\ % = & \mathbf{\tilde{X}}^T 2(\mathbf{Y} - \mathbf{G})^T \phi^\prime(\mathbf{W}_h \mathbf{\tilde{X}}) \end{align}
We cannot, however, set the gradient to zero like we did for linear regression.
This is because optimizing a neural network is not a convex problem. There may be many optimal solutions, called local minima.
To see this, notice that there are symmetries for each hidden unit.
def weight_init(size):
"""Initialize NN weight matrices, Lecun fast backprop
"""
return np.random.uniform(-np.sqrt(3.0 / size[0]),
np.sqrt(3.0 / size[0]), size=size)
class ForwardNet(object):
def __init__(self, x, g, nh,
phi=np.tanh,
phi_prime=lambda v: 1.0 - np.tanh(v)**2,
**kwargs):
"""Feedforward neural network with two fully-connected layers.
"""
ni = x.shape[1] # num inputs
no = g.shape[1] # num outputs
self.phi = phi # transfer func
self.phi_prime = phi_prime # grad of trans func
# initialize hidden weights
self.hw = weight_init((ni+1, nh))
# initialize visible weights
self.vw = weight_init((nh+1, no))
self.train(x, g, **kwargs)
def gradient(self, x, g):
"""Compute MSE and error gradients
"""
# forward pass
x1 = bias(x)
h = x1 @ self.hw
z1 = bias(self.phi(h))
z_prime = self.phi_prime(h)
y = z1 @ self.vw
# error components
e = y - g
delta = 2.0 * (y - g) / e.size
# visible layer gradient
vg = z1.T @ delta
# error backprop through visible layer
delta = delta @ self.vw[:-1].T * z_prime
# hidden layer gradient
hg = x1.T @ delta
# error, hidden grad, visible grad
return np.mean(e**2), hg, vg
def train(self, x, g,
learning_rate=0.1, maxiter=5000, tol=1.0e-5):
"""Steepest descent
"""
i = 0
prev_error = np.inf
while True:
error, hg, vg = self.gradient(x, g)
if (i % 100) == 0:
print(i, error)
if np.abs(prev_error - error) < tol:
print("reached tol")
break
if i > maxiter:
print("reached iter")
break
self.hw -= learning_rate * hg
self.vw -= learning_rate * vg
prev_error = error
i += 1
def predict(self, x):
"""Compute predictions for inputs x
"""
h = x @ self.hw[:-1] + self.hw[-1]
z = self.phi(h)
y = z @ self.vw[:-1] + self.vw[-1]
return y
def eval_hidden(self, x):
"""Generate outputs of hidden units.
"""
h = x @ self.hw[:-1] + self.hw[-1]
z = self.phi(h)
return z
Our quadric example again...
nn_curved = ForwardNet(x, g_curved, nh=5, learning_rate=0.1)
y_curved_nn = nn_curved.predict(x)
plt.scatter(x, g_curved);
plt.plot(x, y_curved_nn, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
0 1.2723877003432735 100 0.1112930946913207 200 0.04307358497418107 300 0.028289726612308108 400 0.022211492667359445 500 0.01915311024513723 600 0.017435331235376474 reached tol
z_curved_nn = nn_curved.eval_hidden(x)
plt.plot(x, z_curved_nn);
plt.xlabel("Input (x)")
plt.ylabel("Hidden unit response (z)");
Something a little more sophisticated
g_ripple = g + 0.5 * np.sin(4*x)
g_ripple -= g.mean(axis=0)
g_ripple /= g.std(axis=0)
nn_ripple = ForwardNet(x, g_ripple, nh=10, learning_rate=0.1)
y_ripple = nn_ripple.predict(x)
plt.scatter(x, g_ripple);
plt.plot(x, y_ripple, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
0 1.6171976294298265 100 0.11186027633140838 200 0.11024823513110617 300 0.1089209635292726 400 0.10759353403007871 500 0.10608882308015342 600 0.10427279235774833 700 0.10201478818278634 800 0.09916321157636462 900 0.09553540684355748 1000 0.09093770500961687 1100 0.08523943368063996 1200 0.07847174842907936 1300 0.07084730964255596 1400 0.06268887575174613 1500 0.05437468447478249 1600 0.04632270787501674 1700 0.03895016204410438 1800 0.03257916637565257 1900 0.027351673446375374 2000 0.023225231994035086 2100 0.020036337854225216 2200 0.017577306849734263 2300 0.015652417040407568 2400 0.014105440873672801 2500 0.012824823826253638 2600 0.011736725831783376 reached tol
z_ripple = nn_ripple.eval_hidden(x)
plt.plot(x, z_ripple);
plt.xlabel("Input (x)")
plt.ylabel("Hidden unit response (z)");
Steepest descent has a lot of problems...
Batch / offline: full gradient
Stochastic / online: one sample at a time
Mini-batch stochastic: small mini-batch of samples for each update
Update weights for each observation, or a small mini-batch of them
Can avoid local minima
works well if cannot fit data into memory or are continually updating model
tends to require fewer passes over all the data
but is noisy (stochastic)
and performs lots of small and expensive gradient evaluations
class ForwardNetSGD(ForwardNet):
"""Feedforward network using Stochastic Gradient Descent
"""
def train(self, x, g, batch_size=10,
learning_rate=0.1, maxiter=5000, tol=1.0e-5):
i = 0
prev_error = np.mean((self.predict(x) - g)**2)
while True:
if i > maxiter:
print("reached iter")
break
for start in range(0, x.shape[0]-1, batch_size):
end = min(start+batch_size, x.shape[0]-1)
_, hg, vg = self.gradient(x[start:end], g[start:end])
self.hw -= learning_rate * hg
self.vw -= learning_rate * vg
error = np.mean((self.predict(x) - g)**2)
if (i % 100) == 0:
print(i, error)
if np.abs(prev_error - error) < tol:
print("reached tol")
break
prev_error = error
i += 1
x_scale = 1.0 * x + 2.0
g = 20.0*x_scale + np.random.normal(size=(ns,1))
g -= g.mean(axis=0)
g /= g.std(axis=0)
g_ripple = g + 0.5 * np.sin(4*x_scale)
g_ripple -= g.mean(axis=0)
g_ripple /= g.std(axis=0)
nn_ripple_sgd = ForwardNetSGD(x_scale, g_ripple, nh=10, learning_rate=0.01)
y_ripple_sgd = nn_ripple_sgd.predict(x_scale)
plt.scatter(x_scale, g_ripple);
plt.plot(x_scale, y_ripple_sgd, color="red");
plt.xlabel("Input")
plt.ylabel("Output");
0 1.59381196409321 100 0.13399162650915225 200 0.12721549764481888 300 0.12500571258241205 400 0.12345153113444972 500 0.12195269061012654 600 0.12042761244416149 700 0.1188034098097414 800 0.11695818995401604 900 0.11473273876224432 1000 0.11212174327286119 1100 0.10931380706517672 1200 0.10633119396964195 1300 0.10311223698659912 1400 0.09965964683605268 1500 0.09601783521233984 1600 0.09223883807715462 1700 0.08837034938406786 1800 0.08445184117778629 1900 0.08051310142340302 2000 0.07657667624604574 2100 0.07266380633954346 2200 0.06879864267526976 2300 0.06500370221353298 2400 0.06128314114772787 2500 0.0576004238685464 2600 0.053889210235500216 2700 0.050176934315300174 2800 0.04665651993392122 2900 0.043443541333521976 3000 0.04050268785972002 3100 0.03777499396521457 3200 0.03522088643077816 3300 0.03281362017949456 3400 0.030532830620618352 3500 0.028362322238931977 3600 0.026287711389304334 3700 0.024294650456415066 3800 0.02237228425429318 3900 0.020520069466889012 4000 0.01874927885359521 4100 0.017077298439337075 4200 0.015520881364415977 4300 0.014092478005196212 4400 0.01279923912977919 4500 0.011643205005571073 reached tol
Uses full gradient, assumes error surface can be approximated by quadratic function in a local neighborhood
import scipy.optimize as spopt
class ForwardNetCG(ForwardNet):
def train(self, x, g,
maxiter=500, tol=1.0e-6):
def error_func(v):
self.hw.flat[...] = v[:self.hw.size]
self.vw.flat[...] = v[self.hw.size:]
return np.mean((self.predict(x) - g)**2)
def grad_func(v):
self.hw.flat[...] = v[:self.hw.size]
self.vw.flat[...] = v[self.hw.size:]
_, hg, vg = self.gradient(x, g)
return np.concatenate((hg.flatten(), vg.flatten()))
def callback(v):
if (callback.i % 100) == 0:
self.hw.flat[...] = v[:self.hw.size]
self.vw.flat[...] = v[self.hw.size:]
print(callback.i, np.mean((self.predict(x) - g)**2))
callback.i += 1
callback.i = 0
method = "BFGS"
options = {"maxiter": maxiter}
w = np.concatenate((self.hw.flat, self.vw.flat))
optres = spopt.minimize(fun=error_func, method=method,
x0=w, tol=tol, jac=grad_func, options=options)#, callback=callback)
print(optres)
print("error: ", np.mean((self.predict(x) - g)**2))
nn_ripple_cg = ForwardNetCG(x, g_ripple, nh=10, maxiter=150)
y_ripple_cg = nn_ripple_cg.predict(x)
plt.scatter(x, g_ripple);
plt.plot(x, y_ripple_cg, color="red");
fun: 0.00246915745731666 hess_inv: array([[ 5.0327e+01, 2.8182e-02, 5.4347e+01, 9.7661e+00, ..., 3.5631e+00, -3.0847e+01, -4.0374e+01, -9.7853e+00], [ 2.8182e-02, 2.7082e+01, 1.0074e+02, 2.9949e+01, ..., 1.8472e+01, -2.9060e+01, -1.0531e+02, 4.3195e+01], [ 5.4347e+01, 1.0074e+02, 4.8906e+02, 1.3734e+02, ..., 8.2194e+01, -1.6427e+02, -4.8530e+02, 1.6620e+02], [ 9.7661e+00, 2.9949e+01, 1.3734e+02, 4.7224e+01, ..., 2.8770e+01, -5.3429e+01, -1.5838e+02, 5.3672e+01], ..., [ 3.5631e+00, 1.8472e+01, 8.2194e+01, 2.8770e+01, ..., 1.9211e+01, -3.2451e+01, -9.8318e+01, 3.4489e+01], [-3.0847e+01, -2.9060e+01, -1.6427e+02, -5.3429e+01, ..., -3.2451e+01, 7.2523e+01, 1.8494e+02, -5.1743e+01], [-4.0374e+01, -1.0531e+02, -4.8530e+02, -1.5838e+02, ..., -9.8318e+01, 1.8494e+02, 5.5171e+02, -1.8725e+02], [-9.7853e+00, 4.3195e+01, 1.6620e+02, 5.3672e+01, ..., 3.4489e+01, -5.1743e+01, -1.8725e+02, 7.9582e+01]]) jac: array([-9.7644e-04, 2.4594e-04, -5.2496e-06, -4.4166e-04, 5.0041e-05, -1.5085e-04, -6.5780e-04, 2.2680e-04, -1.3022e-04, -5.2381e-04, -9.3809e-04, 1.4269e-04, 6.4173e-06, -3.6616e-04, 1.9276e-04, 2.1294e-04, 1.3697e-03, 1.3742e-04, 3.4747e-05, 3.0874e-04, -1.2928e-05, -1.7463e-04, 2.3611e-04, 1.0086e-04, -3.2683e-04, -1.9041e-04, 2.3476e-04, 5.3742e-05, 3.8784e-04, -2.9698e-04, 2.3511e-04]) message: 'Maximum number of iterations has been exceeded.' nfev: 155 nit: 150 njev: 155 status: 1 success: False x: array([-2.0947, -0.0208, 2.9069, 1.1244, -1.2106, -1.035 , 1.5862, -0.0281, 1.7245, -1.6365, 2.2363, -0.9213, 6.8374, -0.9267, -1.1209, -0.1256, 0.7542, 0.2795, 2.2306, -3.0978, -2.6676, 1.3533, -2.8993, -1.8429, 1.963 , 2.1612, 4.7253, 0.6161, -1.3654, -3.2126, 2.4249]) error: 0.00246915745731666