In this Guided Project, we'll explore the effectiveness of deep, feedforward neural networks at classifying images.

In [32]:

```
# Scikit-learn contains a number of datasets pre-loaded with the library, within the namespace of sklearn.datasets.
# The load_digits() function returns a copy of the hand-written digits dataset from UCI.
from sklearn.datasets import load_digits
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

In [33]:

```
# Create a dataframe that contains the handwritten_digits features and a series that contains corresponding labels
handwritten_digits_features, handwritten_digits_labels = load_digits(return_X_y= True)
handwritten_digits_features = pd.DataFrame(handwritten_digits_features)
handwritten_digits_labels = pd.Series(handwritten_digits_labels)
```

In [34]:

```
handwritten_digits_features.head(), handwritten_digits_labels.head()
```

Out[34]:

**In the handwritten_digits_features we see that each image (as each row) has 64 grayscale points, which means they were 8*8 images. **

In [35]:

```
# Transform row 0, 100, 200, 300, 1000, 1100, 1200, 1300 to 8*8 pixel grid and visualize the data
rows = [0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
fig = plt.figure(figsize = (3, 3))
i = 1
for row in rows:
# Transform selected datapoint to np array then reshape to original grid
img = handwritten_digits_features.iloc[row].values.reshape(8, 8)
ax = fig.add_subplot(4, 4, i)
ax.imshow(img, cmap = 'gray_r')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
i+=1
```

In [36]:

```
# Training k-nearest neighbors models.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
# Performs 4-fold cross validation using k-nearest neighbors
def cross_validate(features, labels, n):
kf = KFold(n_splits = 4, random_state = 1)
scores = []
# Split data
for train_idx, test_idx in kf.split(features, labels):
X_train, X_test = features.loc[train_idx], features.loc[test_idx]
y_train, y_test = labels.loc[train_idx], labels.loc[test_idx]
# Train & predict with each fold
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
accuracy = accuracy_score(y_test, predictions, normalize = True)
scores.append(accuracy)
return scores
```

In [37]:

```
# Experiment average recall score with different k(s)
k_num = range(1, 16)
avg_accuracy = []
for n in k_num:
scores = cross_validate(handwritten_digits_features, handwritten_digits_labels, n)
avg_accuracy.append(np.mean(scores))
```

In [38]:

```
# Visualize K, recall score correlation
plt.plot(k_num, avg_accuracy)
plt.xlim(1, 15)
```

Out[38]:

**While the accuracy score is not bad, there are a few downsides to using k-nearest neighbors:**

- high memory usage (for each new unseen observation, many comparisons need to be made to seen observations)
- no model representation to debug and explore

Let's now try a neural network with a single hidden layer. Use the MLPClassifier package from scikit-learn.

* -- based on guides from Coursera Machine Learning course by Andrew Ng from Stanford *

In [39]:

```
# Split data into train and test sets
from sklearn.model_selection import train_test_split
# Split for train and test
X_train, X_test, y_train, y_test = train_test_split(handwritten_digits_features,
handwritten_digits_labels,
test_size = 0.2,
random_state = 1)
# Split for final result of train(60%) and val(20%)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size = 0.25,
random_state = 1)
```

In [40]:

```
# Vectorized Sigmoid function
def sigmoid(X):
return 1/(1+np.exp(-X))
```

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{l}$ uniformly in the range $[-\epsilon_{init}, \epsilon_{init}]$.^{1, 2} This range of values ensures that the parameters are kept small and makes the learning more efficient.

- One effective strategy for choosing ε
*init is to base it on the number of units in the network. A good choice of $ \epsilon*{init} $ is $ \epsilon*{init} = \frac{\sqrt{6}}{\sqrt{L*{in}+L*{out}}} $, where $L*{in} = s*{l} $ and $L*{out}= s_{l+1}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$. - For $ x\in[0, 1] $, $x\times2\times\epsilon_{init}-\epsilon_{init}\in[-\epsilon_{init}, \epsilon_{init}] $

In [41]:

```
# Randomly initialize weights with parameter hidden_layer_size in list datatype
# Returns the same weights in matrix and vector form
def rand_weights_init(X, n_labels, hidden_layer_sizes):
# Number of neurons in input layer
input_layer_size = X.shape[1]
# Compute number of neurons in all layers in the nueral network model
all_layers = [input_layer_size]
all_layers = all_layers+hidden_layer_sizes+ [n_labels]
n_layers = len(all_layers)
# Initialize weights for all layers
weights = []
# Initialize unrolled weights for later use
weights_vec = []
for n in range(n_layers-1):
in_num = all_layers[n]
out_num = all_layers[n+1]
# Initialize epsilon
epsilon_init = np.log(6)/np.log(in_num + out_num)
np.random.seed(0)
# w is set to a matrix of size(out_num, 1 + in_num) as the first column of w handles the "bias" terms
w = np.random.rand(out_num, 1 + in_num) * 2 * epsilon_init - epsilon_init
# Unroll w for later use in fmin_cg function to minimize cost
w_vec = np.concatenate(w)
weights.append(w)
weights_vec.append(w_vec)
print('Number of weights in matrix: ', len(weights))
return np.asarray(weights), np.concatenate(np.asarray(weights_vec))
```

In [42]:

```
# Test run with weights initialization: 1 hidden layer with 8 neurons
weights, weights_vector = rand_weights_init(X_train, 10, [8])
print('Theta1 shape: ', weights[0].shape, '\n','Theta2 shape: ', weights[1].shape)
print('Unrolled weights: ', weights_vector.shape)
```

In [43]:

```
# Test run with weights initialization: 2 hidden layer both with 8 neurons
weights2, weights_vector2 = rand_weights_init(X_train, 10, [8, 8])
print('Theta1 shape: ', weights2[0].shape,
'\n','Theta2 shape: ', weights2[1].shape,
'\n','Theta3 shape: ', weights2[2].shape)
print('Unrolled weights: ', weights_vector2.shape)
```

The cost function for the neural network without regularization is:
$ J(\theta)= \frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{k=1}^K[−y^{(i)}_{k}log((h_\theta(x^{(i)}))_k)−(1−y^{(i)}_{k})log(1−(h_\theta(x^{(i)}))_k)] $, where $m$ is the number of samples, $K$ is the number of labels, $h_\theta(x)$ is the hypothesis, in our case the Sigmoid function.

The regularized cost function adds the regularization terms: $ \frac{\lambda}{2m}\sum\limits_{l = 1}^{L-1}\sum\limits_{no=1}^{N_{out}}\sum\limits_{ni=1}^{N_{in}}(\Theta^{(l)}_{no, ni})^{2} $, where L is the total number of layers in the model, $N_{in}$ and $N_{out}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$

In [44]:

```
# Unregularized cost function
def cost_func(y_prob, y_actual, sample_size):
# Convert y_actual into dummy/indicator variables
y = y_actual.astype('category').copy()
"""
Note to specify dtype when converting categorical dataframe to numpy array
If dtype not specified, -y will be operated on string type where if y = 1, -y = 255
"""
y = np.asarray(pd.get_dummies(y_actual), dtype = np.int64)
# # Convert y_pred into dummy/indicator variables
# # Step 1: Initialize y_pred dummie form with all zeros
# h = np.zeros(y.shape)
# # Step 2: Replace zeros with ones using y_pred as indices
# rows = np.arange(h.shape[0])
# h[rows, y_pred] = 1
# Compute cost
cost = np.sum(-y*np.log10(y_prob)-(1-y)*np.log10(1-y_prob))/sample_size
return cost
```

In [45]:

```
# Feedforward using sigmoid function
def feedforward(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):
# Define some useful variables
n_samples = X.shape[0]
input_layer_size = X.shape[1]
# Compute number of neurons in all layers in the nueral network model
all_layers = [input_layer_size]
all_layers = all_layers+hidden_layer_sizes+ [n_labels]
# print(all_layers)
n_activation = len(all_layers)-1
# Initialize input
x = np.asarray(X)
# # Initialize output
# predictions = np.empty((n_samples, n_labels))
# Initialize a list of weights in matrix form
weights_matrix = []
# Initialize weights without bias for regularization term
weights_no_bias = []
# Initialize activaitons x for calculating 'error term' in backpropagation later
activations_x = []
# Feedforward
for n in range(n_activation):
# Add bias column
x = np.insert(x, 0, 1, axis = 1)
# Append activation x to activations, last activation is also the output
activations_x.append(x)
# Reshape weights vector to weights matrix based on neural network structure
init_slice = 0
weights_count = (all_layers[n]+1)*all_layers[n+1]
weights = np.reshape(vec_weights[init_slice:(init_slice + weights_count)], (all_layers[n+1], (all_layers[n]+1)))
# Append weights to the list of weights in matrix form
weights_matrix.append(weights)
# Compute activation
activation = sigmoid(np.dot(x, weights.T))
# Assign output as the input for next layer, or as final output if loop ends
x = activation
# Get weights without bias in each layerfor later calculating regularization term
no_bias= np.delete(weights, 0, axis = 1)
# Unroll weights in each layer into a numpy 1d array & append to weights_no_bias
weights_no_bias.append(np.concatenate(no_bias))
# Update init_slice
init_slice = weights_count
# Sanity check: x as output should match the shape of predefined p
y_prob = x.copy()
# Assign index of the max value of each row as predictions
y_pred = np.argmax(y_prob, axis = 1)
# Compute unregularized cost
cost = cost_func(y_prob, y, n_samples)
# Unroll weights into a numpy 1d array
weights_no_bias = np.concatenate(np.array(weights_no_bias))
# Regularization term
reg_term = np.sum(weights_no_bias**2)*alpha/(2*n_samples)
# Compute regularized cost
reg_cost = cost + reg_term
return reg_cost, y_prob, y_pred, activations_x, weights_matrix
```

In [46]:

```
# Test feedforward with weights initialization:
# 1 hidden layer with 8 neurons
# from rand_weights_init test run & alpha range(5)
for a in range(5):
reg_cost1, probabilties1, predictions1, activations_x1, weights_matrix1 = feedforward(
vec_weights = weights_vector,
n_labels=10,
X=X_train, y = y_train,
hidden_layer_sizes = [8],
alpha = a)
print('With alpha {}, the regularized cost is: '.format(a), reg_cost1)
print('*'*55)
print('Probabilities shape: ', probabilties1.shape)
print('Predictions shape: ', predictions1.shape, 'predictions: ', predictions1)
print('Activations number:', len(activations_x1))
print('Activation x1 shape', activations_x1[0].shape)
print('Activation x2 shape', activations_x1[1].shape)
print('weights 1 shape', weights_matrix1[0].shape)
print('weights 2 shape', weights_matrix1[1].shape)
```

In [47]:

```
# Test feedforward with weights initialization:
# 2 hidden layer both with 8 neurons
# from rand_weights_init test run & alpha range(5)
for a in range(5):
reg_cost2, probabilties2, predictions2, activations_x2, weights_matrix2 = feedforward(
vec_weights = weights_vector2,
hidden_layer_sizes = [8, 8],
n_labels=10,
X=X_train, y = y_train,
alpha = a)
print('With alpha {}, the regularized cost is: '.format(a), reg_cost2)
print('*'*55)
print('Probabilities shape: ', probabilties2.shape)
print('Predictions shape: ', predictions2.shape, 'predictions: ', predictions2)
print('Activations x number:', len(activations_x2))
print('Activation x1 shape', activations_x2[0].shape)
print('Activation x2 shape', activations_x2[1].shape)
print('Activation x3 shape', activations_x2[2].shape)
print('weights 1 shape', weights_matrix2[0].shape)
print('weights 2 shape', weights_matrix2[1].shape)
print('weights 3 shape', weights_matrix2[2].shape)
```

*The intuition behind the backpropagation algorithm is as follows:*

Given a training example $(x^{(t)},y^{(t)})$, we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_Θ(x)$.

Then, for each node $j$ in layer $l$, we would like to compute an “error term” $\delta_j^{(l)}$ that measures how much that node was “responsible” for any errors in our output. For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta_j^{(output)}$. For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$.

- Sigmoid function: $ g(x) = \frac{1}{(1+ e^{-x})} $
$ g(x) = \frac{1}{(1+ e^{-x})} = (1+e^{-x})^{-1}$

Apply Chain Rule: $ g'(x) = -1\times(1+e^{-x})^{-2}\times(1+e^{-x})' $

- Apply Exponential Rule $ g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times(-x)') $
- $ g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times-1) $
- $ g'(x) = \frac{1}{(1+e^{-x})^{2}}\times e^{-x} $
- $ g'(x) = \frac{1}{(1+e^{-x})}\times\frac{e^{-x}}{{(1+e^{-x})}} $
- $ g'(x) = g(x)\times\frac{1+e^{-x}-1}{{(1+e^{-x})}} $
- $ g'(x) = g(x)\times(1-g(x)) $

In [48]:

```
# Derivative of the Sigmoid function
def sigmoid_gradient(X):
return sigmoid(X)*(1-sigmoid(X))
```

In [49]:

```
# Sanity check sigmoid_gradient
sigmoid_gradient(0)
```

Out[49]:

In [50]:

```
"""
Implement the backpropagation algorithm to compute the gradients of weights to obtain
the gradient for the neural network cost function.
This function returns the partial derivatives of the cost function with respect to the weights in each hidden layer,
for later use in fmin_cg function to minimize cost.
"""
def backpropagation(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):
reg_cost, y_prob, y_pred, activations_x, weights = feedforward(vec_weights,
hidden_layer_sizes,
n_labels,
X, y, alpha)
n_activations = len(activations_x)
n_samples = X.shape[0]
# Turn y into dummy index
y = pd.get_dummies(y.astype('category'))
y = np.array(y, dtype = np.int64)
# Compute 'error term' delta of the output layer
last_layer_weights_delta = y_prob - y
# Initialze a list of deltas from each hidden layer
layer_weights_delta = []
# Compute 'error term' delta of the hidden layers
for n in reversed(range(n_activations)):
delta = np.dot(last_layer_weights_delta.T, activations_x[n])
avg_delta = delta/n_samples
# Compute regularization of the corresponding weights_grad
weights_temp = weights[n].copy()
# Replace the first column of weights with 0 to leave out bias column in regularized cost
weights_temp[:,1] = 0
reg_term = (alpha/n_samples)*weights_temp.copy()
# Update weights_grad with regularization term
layer_grad_reg = avg_delta+reg_term
print('Theta gradient {} shape: '.format(n+1), layer_grad_reg.shape)
# Unroll delta and insert it as the first item in weights_gradient list
layer_weights_delta.insert(0, np.concatenate(layer_grad_reg))
# Update last_layer_weights_delta
last_layer_weights_delta = np.dot(last_layer_weights_delta,
weights[n][:,1:])*sigmoid_gradient(activations_x[n][:,1:])
return np.concatenate(layer_weights_delta)
```

In [51]:

```
# Test backpropagation with feedforwd test results
weights_grad = backpropagation(vec_weights = weights_vector2,
hidden_layer_sizes = [8, 16, 8],
n_labels=10,
X=X_train, y = y_train,
alpha = 1)
weights_grad.shape
```

Out[51]:

*Put all the functions we build so far together, as one function that computes the regularized Neural Network model cost and weights gradient*

In [52]:

```
# Create a function to minize that returns the cost
def nn_cost(vec_weights, *args):
hidden_layer_sizes, n_labels, X, y, alpha = args
reg_cost, y_prob, predictions, activations_x, weights = feedforward(vec_weights,
hidden_layer_sizes,
n_labels,
X, y,
alpha)
return reg_cost
```

In [53]:

```
# Create a function to minize that returns the weights cost gradient
def nn_cost_gradient(vec_weights, *args):
hidden_layer_sizes, n_labels, X, y, alpha = args
return backpropagation(vec_weights,
hidden_layer_sizes,
n_labels,
X, y,
alpha)
```

In [54]:

```
# Initialize new weights with rand_weights_init and test nn_cost, nn_cost_gradient
test_weights_m, test_weights_vec = rand_weights_init(X_train, 10, [8,16,8])
args = ([8], 10, X_train, y_train, 0)
cost = nn_cost(test_weights_vec, *args)
grad = nn_cost_gradient(test_weights_vec, *args)
print(cost)
```

*Using the function fmin_cg from SciPy library to minimize our cost function*

In [55]:

```
from scipy.optimize import fmin_cg
from scipy.optimize import minimize
```

In [62]:

```
weights_1_m, weights_1_v = rand_weights_init(X_train, 10, [8])
```

In [64]:

```
# Using nn_cost
optim_weights = fmin_cg(f = nn_cost, x0 = weights_1_v, args = args, maxiter = 100)
```

In [65]:

```
nn_cost(optim_weights, *args)
```

Out[65]:

In [66]:

```
reg_cost_optm, y_prob_optm, predictions_optm, activations_x_optm, weights_optm = feedforward(optim_weights,
[8],
10,
X_train, y_train,
alpha=1)
```

In [67]:

```
accuracy = (predictions_optm == y_train).mean()
```

In [68]:

```
accuracy
```

Out[68]:

In [ ]:

```
```