Neural networks are one approach to machine learning that attempts to deal with the problem of large data dimensionality. The neural network approach uses a fixed number of basis functions - in contrast to methods such as support vector machines that attempt to adapt the number of basis functions - that are themselves parameterized by the model parameters. This is a significant departure from linear regression and logistic regression methods where the models consisted of linear combinations of fixed basis functions, $\phi(\mathbf{x})$, that dependend only on the input vector, $\mathbf{x}$. In neural networks, the basis functions can now depend on both the model parameters and the input vector and thus take the form $\phi(\mathbf{x} | \mathbf{w})$.

Here we will cover only **feed-forward** neural networks. One can envision a neural network as a series of layers where each layer has some number of nodes and each node is a function. Each layer represents a single linear regression model. The nodes are connected to each other through inputs accepted from and outputs passed to other nodes. A *feed-forward* network is one in which these connections do not form any directed cycles. See here for more detail.

As a matter of convention, we will refer the model as a $N$ layer model where $N$ is the number of layers for which adaptive parameters, $\mathbf{w}$, must be determined. Thus for a model consisting of an input layer, one hidden layer, and an output layer, we consider $N$ to be 2 since parameters are determined only for the hidden and output layers.

- The final output will be a vector $Y$ with $K$ elements, $y_k$, where $y_k(\mathbf{x},\mathbf{w}) = p(C_1|\mathbf{x})$ is the probability that node $k$ is in class $C_1$ and $p(C_2|\mathbf{x}) = 1-p(C_1|\mathbf{x})$
- The activation function at a given layer is an arbitrary nonlinear function of a linear combination of the inputs and parameters for that layer
- The network is fully connected, i.e. every node at the input layer is connected to every node in the hidden layer and every node in the hidden layer is connected to every node in the output layer
- A bias parameter is included at the hidden and output layers

Working from the input layer toward the output layer, we can build this model as follows:

$a_m = \sum_{i=0}^D w_{ji}^{(1)}x_i$

where the $(1)$ superscript indicates this weight is for the hidden layer. The output from each node, $z_m$, is then given by the value of a *fixed nonlinear function*, $h$, known as the *activation function*, acting on the unit activation

$z_m = h(a_m) = h \left( \sum_{i=0}^D w_{mi}^{(1)}x_i \right)$

Notice that $h$ is the same function for all nodes.

$a_k = \sum_{m=0}^M w_{km}^{(2)} z_m$

We again apply a nonlinear activation function, say $y$, to produce the output

$y_k = y(a_k)$

Thus, the entire model can be summarized as a $K$ dimensional output vector $Y \in \Re^K$ where each element $y_k$ by

$y_k(\mathbf{x},\mathbf{w}) = y \left( \sum_{m=0}^M w_{km}^{(2)} h \left( \sum_{i=0}^D w_{mi}^{(1)}x_i \right) \right)$

- Addition of hidden layers
- Inclusion of
*skip-layer*connections, e.g. a connection from an input node directly to an output node - Sparse network, i.e. not a fully connected network

$p(t|\mathbf{x},\mathbf{w}) = ND(t|y(\mathbf{x},\mathbf{w}), \beta^{-1})$

where $ND(\mu,\sigma^2)$ is the normal distribution with mean $\mu$ and variance $\sigma^2$. *Assume* the output layer activation function is the identity, i.e. the output is simply the the unit activations, and that we have $N$ i.i.d. observations $\mathbf{X}=\\{\mathbf{x}_1, \ldots, \mathbf{x}_2 \\}$ and target values $\mathbf{t} = \\{t_1, \ldots, t_2\\}$, the likelihood function is

$p(\mathbf{t}|\mathbf{X},\mathbf{w}, \beta) = \prod_{n=1}^N p(t_n|\mathbf{x}_n, \mathbf{w}, \beta)$

The **total error function** is defined as the negative logarithm of the likelihood function given by

$\frac {\beta}{2} \sum_{n=1}^N \\{y(\mathbf{x}_n,\mathbf{w}) - t_n \\}^2 - \frac{N}{2} \ln(\beta) + \frac{N}{2} \ln (2\pi)$

The parameter estimates, $\mathbf{w}^{(ML)}$ (ML indicates maximum likelihood) are found by maximizing the equivalent sum-of-squares error function

$E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N \\{y(\mathbf{x}_n,\mathbf{w}) - t_n \\}^2$

Note, due to the nonlinearity of the network model, $E(\mathbf{w})$ is not necessisarily convex, and thus may have *local minima* and hence it is not possible to know that the global minimum has been obtained. The model parameter, $\beta$ is found by first finding $\mathbf{w}_{ML}$ and then solving

$\frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^N \\{ y(\mathbf{x}_n,\mathbf{w}^{(ML)}) - t_n \\}^2$

$p(\mathbf{t}|\mathbf{x},\mathbf{w}) = ND(\mathbf{t} | \mathbf{y}(\mathbf{x},\mathbf{w}), \beta^{-1} \mathbf{I})$

The parameter estimates, $\mathbf{w}^{(ML)}$ are again found by minimizing the sum-of-squares error function, $E(\mathbf{w})$, and the estimate for $\beta$ is found from

$\frac{1}{\beta_{ML}} = \frac{1}{NK}\sum_{n=1}^N ||y(\mathbf{x}_n,\mathbf{w}^{(ML)}) - t_n ||^2$

$y=1/(1+\exp(-a))$

where $a$ is the output of the hidden layer. We can interpret the network output, $y(\mathbf{x},\mathbf{w})$ as the conditional probability $p(C_1|\mathbf{x})$ with $p(C_2|\mathbf{x})=1-y(\mathbf{x},\mathbf{w})$ so that the coniditional probability takes a Bernoulli distribution

$p(t|\mathbf{x},\mathbf{w}) = y(\mathbf{x},\mathbf{w})^t\[1 - y(\mathbf{x},\mathbf{w}) \]^{1-t}$

Given a training set with $N$ observations the error function is given by

$E(\mathbf{w}) = - \sum_{n=1}^N \\{t_n \ln (y_n) + (1 - t_n) \ln (1-y_n) \\}$

where $y_n = y(\mathbf{x_n},\mathbf{w})$. This model *assumes* all training inputs are correctly labeled.

$p(\mathbf{t}|\mathbf{x},\mathbf{w}) = \prod_{k=1}^K y_k(\mathbf{x},\mathbf{w})^{t_k} \[ 1 - y_k(\mathbf{x},\mathbf{w})\]^{1-t_k}$

Given a training input with $N$ values (note that the training set is an $K \times N$ matrix) the error function is

$E(\mathbf{w} = - \sum_{n=1}^N \sum_{k=1}^K \[ t_{nk} \ln (y_{nk}) + (1-t_{nk}) \ln (1-y_{nk}) \]$

where $y_{nk} = y_k(\mathbf{x}_n, \mathbf{w}).

$E(\mathbf{w}) = - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_k(\mathbf{x},\mathbf{w})$

where the output activation function $y_k$ is the *softmax function*

$y_k(\mathbf{x},\mathbf{w}) = \frac{\exp(a_k(\mathbf{x},\mathbf{w})}{\sum_j \exp(a_j(\mathbf{x},\mathbf{w}))}$

- Evaluate the derivatives of the error function with respect to the parameters
- Use the derivatives to iterate the values of the parameters

In this section, we consider **error backpropagation** which provides an efficient way to evaluate a feed-forward neural network's error function. *Note* that this only satisfies step 1 above. It is still necessary to choose an optimization technique in order to use the derivative information provided by error backpropagation to update the parameter estimates as indicated by step 2.

The results presented here are applicable to

- An
**arbitary**feed-forward network toplogy **Arbitrary**differentiable nonlinear activation functions

The one *assumption* that is made is that the error function can be expressed as a *sum of terms*, one for each training data point, $t_n$ for $n \in \\{1\ldots N\\}$, so that

$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w},t_n)$

For such error functions, the derivative of the error function with respect to $\mathbf{w}) takes the form

$\bigtriangledown E(\mathbf{w}) = \sum_{n=1}^N \bigtriangledown E_n(\mathbf{w},t_n)$

Thus we need only consider the evaluation of $\bigtriangledown E_n(\mathbf{w},t_n)$ which may be used directly for sequential optimization techniques or summed over the training set for batch techniques (see next section). The error backpropagation method can be summarized as follows

- Given an input vector, $\mathbf{x}_n$, forward propagate through the network using
- $a_j = \sum_i w_{ji}z_i$
- $z_j = h(a_j)$

- Evaluate $\delta_k \equiv \frac{\partial E_n} {\partial a_k}$ for the ouptput layer as
- $\delta_k = y_k - t_k$

- Backpropagate the $\delta$'s to obtain $\delta_j$ for each hidden unit in the network using
- $\delta_j = h'(a_j) \sum_k w_{kj}\delta_k$

- Evaluate the derivatives using
- $\frac{\partial E_n} {\partial w_{ji}} = \delta_j z_i$

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \bigtriangledown E(\mathbf{w}^{(\tau)})$

where $\eta>0$ is the *learning rate*. Other **more advanced** optimization algorithms inlcude *conjugate gradients* and *quasi-Newton* methods.

$x_1$ | $x_2$ | Output |
---|---|---|

0 | 0 | 0 |

1 | 0 | 1 |

0 | 1 | 1 |

1 | 1 | 0 |

While this may be a trivial example, it illustrates all of the features of an arbitary feed-forward network while remaining simple enough to understand everything that is happening in the network. The XOR network is typically presented as having 2 input nodes, 2 hidden layer nodes, and one output nodes, for example see here. Such an arrangement however requires that the nodes in the hidden layer use a thresholding scheme whereby the output of the node is 0 or 1 depending on the sum of the input being greater than some fixed threshold value. However, such a network would violoate our model assumptions for training the network. Specifically, we could not use error backpropagation because the node activation functions would be step functions which are not differentiable. To overcome this, we will the *hyperbolic tangent*, *tanh*, as the hidden layer activation function and add a third node to the hidden layer representing a bias term. Our output layer will have a single node. We will interpret the out values as being 0 (or false) for output values less than 0 and 1 (or true) for output values greater than 0. The figure below provides a graphical representation of the network. Note, the parameters $w$ and $a$ are distinct for each layer, e.g. $w_{11}$ on the edge between $x_1$ and $h(a_1)$ is not the same $w_{11}$ on the edge from $h(a_1)$ to $\sigma(a_1)$.

"The feedforward neural network for the 2 input Exclusive Or problem"

In [5]:

```
import networkx as nx
import sys
from math import tanh
sys.path.append('pysrc')
import neural_networks as nn
from functools import partial
DG = nx.DiGraph()
DG.add_nodes_from(['X1','X2','H0','H1','H2','O1'])
#set the activation functions
DG.node['H0']['af']=1.0
for node in ['H1','H2','O1']:
DG.node[node]['af']=tanh
#set the derivatives of the activation functions for all nodes except the output, this is done below
def zero(x):
return 0
for node in ['X1','X2','H0']: #the inputs and bias terms have zero derivatives
DG.node[node]['daf'] = zero
def dtanh(x):
return 1.0 - tanh(x) * tanh(x)
for node in ['H1','H2']:
DG.node[node]['daf']=dtanh
#create the edges
for source in ['X1','X2']:
for target in ['H1','H2']:
DG.add_weighted_edges_from([(source,target,1.0)])
for source in ['H0','H1','H2']:
DG.add_weighted_edges_from([(source, 'O1', 1.0)])
#set the input values
DG.node['X1']['af']=0
DG.node['X2']['af']=1
#given these inputs, the correct output should be 1
#we'll use a partial function so we can assign the correct 'daf' value
#dynamically when we iteratively train the network
def dout(x, t):
if x<0:
xx = 0
else:
xx = 1
return xx - t
DG.node['O1']['daf']=partial(dout, t=1)
nn.forward_prop(DG)
nn.error_back_prop(DG)
for node in ['X1','X2','H0','H1','H2','O1']:
print "node {0} has output {1} and error {2}".format(node, DG.node[node]['o'], DG.node[node]['e'])
```

In [ ]:

```
```