Notebook

Neural Networks

Neural networks are one approach to machine learning that attempts to deal with the problem of large data dimensionality. The neural network approach uses a fixed number of basis functions - in contrast to methods such as support vector machines that attempt to adapt the number of basis functions - that are themselves parameterized by the model parameters. This is a significant departure from linear regression and logistic regression methods where the models consisted of linear combinations of fixed basis functions, $\phi(\mathbf{x})$, that dependend only on the input vector, $\mathbf{x}$. In neural networks, the basis functions can now depend on both the model parameters and the input vector and thus take the form $\phi(\mathbf{x} | \mathbf{w})$.

Here we will cover only feed-forward neural networks. One can envision a neural network as a series of layers where each layer has some number of nodes and each node is a function. Each layer represents a single linear regression model. The nodes are connected to each other through inputs accepted from and outputs passed to other nodes. A feed-forward network is one in which these connections do not form any directed cycles. See here for more detail.

alt text

As a matter of convention, we will refer the model as a $N$ layer model where $N$ is the number of layers for which adaptive parameters, $\mathbf{w}$, must be determined. Thus for a model consisting of an input layer, one hidden layer, and an output layer, we consider $N$ to be 2 since parameters are determined only for the hidden and output layers.

Feed-forward Network Functions

We will consider a basic two layer nueral network model, i.e a model that maps inputs to a hidden layer and then to an output layer. We will make **the following assumptions**

The final output will be a vector $Y$ with $K$ elements, $y_k$, where $y_k(\mathbf{x},\mathbf{w}) = p(C_1|\mathbf{x})$ is the probability that node $k$ is in class $C_1$ and $p(C_2|\mathbf{x}) = 1-p(C_1|\mathbf{x})$
The activation function at a given layer is an arbitrary nonlinear function of a linear combination of the inputs and parameters for that layer
The network is fully connected, i.e. every node at the input layer is connected to every node in the hidden layer and every node in the hidden layer is connected to every node in the output layer
A bias parameter is included at the hidden and output layers

Working from the input layer toward the output layer, we can build this model as follows:

Input Layer

Assume we have an input vector $\mathbf{x} \in \Re^D$. Then the input layer consists of $D+1$ nodes where the value of the $i^{th}$ node for $i=0\ldots D$, is 0 if $i=0$ and $x_i$, i.e. the $i^{th}$ value of $\mathbf{x}$, otherwise.

Hidden Layer

At the hidden layer we construct $M$ nodes where the value of $M$ depends on the specifics of the particular modelling problem. For each node, we define a *unit activation*, $a_m$, for $m=1\ldots M$ as
$a_m = \sum_{i=0}^D w_{ji}^{(1)}x_i$
where the $(1)$ superscript indicates this weight is for the hidden layer. The output from each node, $z_m$, is then given by the value of a *fixed nonlinear function*, $h$, known as the *activation function*, acting on the unit activation
$z_m = h(a_m) = h \left( \sum_{i=0}^D w_{mi}^{(1)}x_i \right)$
Notice that $h$ is the same function for all nodes.

Output Layer

The process at the output layer is essentially the same as at the hidden layer. We construct $K$ nodes, where again the value of $K$ depends on the specific modeling problem. For each node, we again define a *unit activation*, $a_k$, for $k=1 \ldots K$ by
$a_k = \sum_{m=0}^M w_{km}^{(2)} z_m$
We again apply a nonlinear activation function, say $y$, to produce the output
$y_k = y(a_k)$

Thus, the entire model can be summarized as a $K$ dimensional output vector $Y \in \Re^K$ where each element $y_k$ by
$y_k(\mathbf{x},\mathbf{w}) = y \left( \sum_{m=0}^M w_{km}^{(2)} h \left( \sum_{i=0}^D w_{mi}^{(1)}x_i \right) \right)$

Generalizations

There are a wide variety of generalizations possible for this model. Some of the more important ones for practical applications include

Addition of hidden layers
Inclusion of skip-layer connections, e.g. a connection from an input node directly to an output node
Sparse network, i.e. not a fully connected network

Network Training

Here we will consider how to determine the network model parameters. We will use a *maximum likelihood* approach. The likelihood function for the network is dependent on the type of problem the network models. Of particular importance is the number and type of values generated at the output layer. We consider several cases below.

Regression Single Gaussian Target

*Assume* that our target variable, *t*, is Gaussian distributed with an $\mathbf{x}$ dependent mean, $\mu(\mathbf{x})$ and constant variance $\sigma^2 = 1/\beta$. Our network model is designed to estimate the unknown function for the mean, i.e. $y(\mathbf{x},\mathbf{w}) \approx \mu(\mathbf{x})$ so that the distribution for the target value, *t*, is modeled by

$p(t|\mathbf{x},\mathbf{w}) = ND(t|y(\mathbf{x},\mathbf{w}), \beta^{-1})$

where $ND(\mu,\sigma^2)$ is the normal distribution with mean $\mu$ and variance $\sigma^2$. *Assume* the output layer activation function is the identity, i.e. the output is simply the the unit activations, and that we have $N$ i.i.d. observations $\mathbf{X}=\\{\mathbf{x}_1, \ldots, \mathbf{x}_2 \\}$ and target values $\mathbf{t} = \\{t_1, \ldots, t_2\\}$, the likelihood function is

$p(\mathbf{t}|\mathbf{X},\mathbf{w}, \beta) = \prod_{n=1}^N p(t_n|\mathbf{x}_n, \mathbf{w}, \beta)$

The **total error function** is defined as the negative logarithm of the likelihood function given by

$\frac {\beta}{2} \sum_{n=1}^N \\{y(\mathbf{x}_n,\mathbf{w}) - t_n \\}^2 - \frac{N}{2} \ln(\beta) + \frac{N}{2} \ln (2\pi)$

The parameter estimates, $\mathbf{w}^{(ML)}$ (ML indicates maximum likelihood) are found by maximizing the equivalent sum-of-squares error function

$E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N \\{y(\mathbf{x}_n,\mathbf{w}) - t_n \\}^2$

Note, due to the nonlinearity of the network model, $E(\mathbf{w})$ is not necessisarily convex, and thus may have local minima and hence it is not possible to know that the global minimum has been obtained. The model parameter, $\beta$ is found by first finding $\mathbf{w}_{ML}$ and then solving

$\frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^N \\{ y(\mathbf{x}_n,\mathbf{w}^{(ML)}) - t_n \\}^2$

Regression Multiple Gaussian Targets

Now *assume* that we have $K$ Gaussian distributed target variables, $\[t_1, \ldots, t_K\]$, each with a mean that is independently conditional on $\mathbf{x}$, i.e. the mean of $t_k$ is defined by some function $\mu_k(\mathbf{x})$. Also assume that all $K$ variables share the same variance, $\sigma^2=1/\beta$. Assuming the network output layer has $K$ nodes where $y_k(\mathbf{x},\mathbf{w}) \approx \mu_k(\mathbf{x})$ and letting $\mathbf{y}(\mathbf{x},\mathbf{w}) = \[y_1(\mathbf{x},\mathbf{w}), \ldots, y_K(\mathbf{x},\mathbf{w})\]$, and that we again have $N$ training target values $\mathbf{t}$ ($\mathbf{t}$ is a $K \times N$ matrix of the training values), the conditional distribution of the target training values is given by

$p(\mathbf{t}|\mathbf{x},\mathbf{w}) = ND(\mathbf{t} | \mathbf{y}(\mathbf{x},\mathbf{w}), \beta^{-1} \mathbf{I})$

The parameter estimates, $\mathbf{w}^{(ML)}$ are again found by minimizing the sum-of-squares error function, $E(\mathbf{w})$, and the estimate for $\beta$ is found from

$\frac{1}{\beta_{ML}} = \frac{1}{NK}\sum_{n=1}^N ||y(\mathbf{x}_n,\mathbf{w}^{(ML)}) - t_n ||^2$

Binary Classification Single Target

Now *assume* we have a single target variable, $t\in \\{0,1\\}$, such that $t=1$ denotes class $C_1$ and $t=0$ denotes class $C_2$. *Assume* that the network has a single output node whos activation function is the logistic sigmoid

$y=1/(1+\exp(-a))$

where $a$ is the output of the hidden layer. We can interpret the network output, $y(\mathbf{x},\mathbf{w})$ as the conditional probability $p(C_1|\mathbf{x})$ with $p(C_2|\mathbf{x})=1-y(\mathbf{x},\mathbf{w})$ so that the coniditional probability takes a Bernoulli distribution

$p(t|\mathbf{x},\mathbf{w}) = y(\mathbf{x},\mathbf{w})^t\[1 - y(\mathbf{x},\mathbf{w}) \]^{1-t}$

Given a training set with $N$ observations the error function is given by

$E(\mathbf{w}) = - \sum_{n=1}^N \\{t_n \ln (y_n) + (1 - t_n) \ln (1-y_n) \\}$

where $y_n = y(\mathbf{x_n},\mathbf{w})$. This model assumes all training inputs are correctly labeled.

Binary Classification $K$ Seperate Targets

Assume we have $K$ seperate binary classification target variables (i.e. there are K classification sets each with two classes. The K sets are independent and the input will be assigned to one class from each set). This can be modeled with network having $K$ nodes in the output layer where the activation function of each node is the logistic sigmoid. *Assume* the class labels are independent, i.e. $p(x_i \in C_1| x_j \in C_1) = p(x_i \in C_1)$ $\forall \\{i,j\\}$ (this is often an invalid assumption in practice), and that we have some training set, $\mathbf{t}$, then the coniditional probability of a single output vector $\mathbf{t}$ is

$p(\mathbf{t}|\mathbf{x},\mathbf{w}) = \prod_{k=1}^K y_k(\mathbf{x},\mathbf{w})^{t_k} \[ 1 - y_k(\mathbf{x},\mathbf{w})\]^{1-t_k}$

Given a training input with $N$ values (note that the training set is an $K \times N$ matrix) the error function is

$E(\mathbf{w} = - \sum_{n=1}^N \sum_{k=1}^K \[ t_{nk} \ln (y_{nk}) + (1-t_{nk}) \ln (1-y_{nk}) \]$

where $y_{nk} = y_k(\mathbf{x}_n, \mathbf{w}).

Multi-class Classification

Assume we have a $K$ **mutually exclusive** classes (i.e. the input can only be in one of the $K$ classes). This network is also modelled with $K$ output nodes where each output node represents the probability that the input is in class $k$, $y_k(\mathbf{x},\mathbf{w}) = p(t_k = 1 | \mathbf{x})$. The error function is given by

$E(\mathbf{w}) = - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_k(\mathbf{x},\mathbf{w})$

where the output activation function $y_k$ is the softmax function

$y_k(\mathbf{x},\mathbf{w}) = \frac{\exp(a_k(\mathbf{x},\mathbf{w})}{\sum_j \exp(a_j(\mathbf{x},\mathbf{w}))}$

Error Backpropagation

Most training algorithms involve a two step procedure for minimizing the model parameter dependent error function

Evaluate the derivatives of the error function with respect to the parameters
Use the derivatives to iterate the values of the parameters

In this section, we consider error backpropagation which provides an efficient way to evaluate a feed-forward neural network's error function. Note that this only satisfies step 1 above. It is still necessary to choose an optimization technique in order to use the derivative information provided by error backpropagation to update the parameter estimates as indicated by step 2.

The results presented here are applicable to

An arbitary feed-forward network toplogy
Arbitrary differentiable nonlinear activation functions

The one assumption that is made is that the error function can be expressed as a sum of terms, one for each training data point, $t_n$ for $n \in \\{1\ldots N\\}$, so that

$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w},t_n)$

For such error functions, the derivative of the error function with respect to $\mathbf{w}) takes the form

$\bigtriangledown E(\mathbf{w}) = \sum_{n=1}^N \bigtriangledown E_n(\mathbf{w},t_n)$

Thus we need only consider the evaluation of $\bigtriangledown E_n(\mathbf{w},t_n)$ which may be used directly for sequential optimization techniques or summed over the training set for batch techniques (see next section). The error backpropagation method can be summarized as follows

Given an input vector, $\mathbf{x}_n$, forward propagate through the network using
1. $a_j = \sum_i w_{ji}z_i$
2. $z_j = h(a_j)$
Evaluate $\delta_k \equiv \frac{\partial E_n} {\partial a_k}$ for the ouptput layer as
1. $\delta_k = y_k - t_k$
Backpropagate the $\delta$'s to obtain $\delta_j$ for each hidden unit in the network using
1. $\delta_j = h'(a_j) \sum_k w_{kj}\delta_k$
Evaluate the derivatives using
1. $\frac{\partial E_n} {\partial w_{ji}} = \delta_j z_i$

Parameter Optimization

Given the neural network error function derivative estimate provided by *error backpropagation*, one can use a variety of numerical optimization techniques to find appropriate parameter estimates. The simplest technique is the method of *steepest descent* where the new parameter estimate $\mathbf{w}^{(\tau+1)}$ is given by

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \bigtriangledown E(\mathbf{w}^{(\tau)})$

where $\eta>0$ is the learning rate. Other more advanced optimization algorithms inlcude conjugate gradients and quasi-Newton methods.

Example: Exlusive or

As a first example, let's consider the classical [Exlusive Or (XOR) problem](http://en.wikipedia.org/wiki/Exclusive_or). We will consider the XOR problem involving two inputs $x_1$ and $x_2$ where $x_i \in \\{0,1\\}$. This problem states that the output should be 1 if exactly one of the inputs is 1 and 0 otherwise. Thus this problem has a very simple known input output relationship

$x_1$	$x_2$	Output
0	0	0
1	0	1
0	1	1
1	1	0

While this may be a trivial example, it illustrates all of the features of an arbitary feed-forward network while remaining simple enough to understand everything that is happening in the network. The XOR network is typically presented as having 2 input nodes, 2 hidden layer nodes, and one output nodes, for example see here. Such an arrangement however requires that the nodes in the hidden layer use a thresholding scheme whereby the output of the node is 0 or 1 depending on the sum of the input being greater than some fixed threshold value. However, such a network would violoate our model assumptions for training the network. Specifically, we could not use error backpropagation because the node activation functions would be step functions which are not differentiable. To overcome this, we will the hyperbolic tangent, tanh, as the hidden layer activation function and add a third node to the hidden layer representing a bias term. Our output layer will have a single node. We will interpret the out values as being 0 (or false) for output values less than 0 and 1 (or true) for output values greater than 0. The figure below provides a graphical representation of the network. Note, the parameters $w$ and $a$ are distinct for each layer, e.g. $w_{11}$ on the edge between $x_1$ and $h(a_1)$ is not the same $w_{11}$ on the edge from $h(a_1)$ to $\sigma(a_1)$.

"The feedforward neural network for the 2 input Exclusive Or problem"

In [5]:

import networkx as nx
import sys
from math import tanh
sys.path.append('pysrc')
import neural_networks as nn
from functools import partial

DG = nx.DiGraph()
DG.add_nodes_from(['X1','X2','H0','H1','H2','O1'])

#set the activation functions
DG.node['H0']['af']=1.0
for node in ['H1','H2','O1']:
    DG.node[node]['af']=tanh
    
#set the derivatives of the activation functions for all nodes except the output, this is done below
def zero(x):
    return 0
for node in ['X1','X2','H0']: #the inputs and bias terms have zero derivatives
    DG.node[node]['daf'] = zero
def dtanh(x):
    return 1.0 - tanh(x) * tanh(x)
for node in ['H1','H2']:
    DG.node[node]['daf']=dtanh
    
#create the edges
for source in ['X1','X2']:
    for target in ['H1','H2']:
        DG.add_weighted_edges_from([(source,target,1.0)])
for source in ['H0','H1','H2']:
    DG.add_weighted_edges_from([(source, 'O1', 1.0)])
    
#set the input values
DG.node['X1']['af']=0
DG.node['X2']['af']=1

#given these inputs, the correct output should be 1
#we'll use a partial function so we can assign the correct 'daf' value
#dynamically when we iteratively train the network
def dout(x, t):
    if x<0:
        xx = 0
    else:
        xx = 1
    return xx - t
DG.node['O1']['daf']=partial(dout, t=1)

nn.forward_prop(DG)

nn.error_back_prop(DG)

for node in ['X1','X2','H0','H1','H2','O1']:
    print "node {0} has output {1} and error {2}".format(node, DG.node[node]['o'], DG.node[node]['e'])

node X1 has output 0 and error 0.0
node X2 has output 1 and error 0.0
node H0 has output 1.0 and error 0.0
node H1 has output 0.761594155956 and error 0.0
node H2 has output 0.761594155956 and error 0.0
node O1 has output 0.987217029728 and error 0.0

In [ ]: