# Building A Handwritten Digits Classifier¶

In this Guided Project, we'll explore the effectiveness of deep, feedforward neural networks at classifying images.

In [32]:
# Scikit-learn contains a number of datasets pre-loaded with the library, within the namespace of sklearn.datasets.
# The load_digits() function returns a copy of the hand-written digits dataset from UCI.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Read in dataset as Dataframe¶

In [33]:
# Create a dataframe that contains the handwritten_digits features and a series that contains corresponding labels
handwritten_digits_features = pd.DataFrame(handwritten_digits_features)
handwritten_digits_labels = pd.Series(handwritten_digits_labels)

In [34]:
handwritten_digits_features.head(), handwritten_digits_labels.head()

Out[34]:
(    0    1    2     3     4     5    6    7    8    9  ...    54   55   56  \
0  0.0  0.0  5.0  13.0   9.0   1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0
1  0.0  0.0  0.0  12.0  13.0   5.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0
2  0.0  0.0  0.0   4.0  15.0  12.0  0.0  0.0  0.0  0.0 ...   5.0  0.0  0.0
3  0.0  0.0  7.0  15.0  13.0   1.0  0.0  0.0  0.0  8.0 ...   9.0  0.0  0.0
4  0.0  0.0  0.0   1.0  11.0   0.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0

57   58    59    60    61   62   63
0  0.0  6.0  13.0  10.0   0.0  0.0  0.0
1  0.0  0.0  11.0  16.0  10.0  0.0  0.0
2  0.0  0.0   3.0  11.0  16.0  9.0  0.0
3  0.0  7.0  13.0  13.0   9.0  0.0  0.0
4  0.0  0.0   2.0  16.0   4.0  0.0  0.0

[5 rows x 64 columns], 0    0
1    1
2    2
3    3
4    4
dtype: int64)

In the handwritten_digits_features we see that each image (as each row) has 64 grayscale points, which means they were 8*8 images.

## Visualize a few sample data¶

In [35]:
# Transform row 0, 100, 200, 300, 1000, 1100, 1200, 1300 to 8*8 pixel grid and visualize the data
rows = [0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
fig = plt.figure(figsize = (3, 3))
i = 1
for row in rows:
# Transform selected datapoint to np array then reshape to original grid
img = handwritten_digits_features.iloc[row].values.reshape(8, 8)
ax.imshow(img, cmap = 'gray_r')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
i+=1


## Experiment with KNeighborsClassifier¶

In [36]:
# Training k-nearest neighbors models.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# Performs 4-fold cross validation using k-nearest neighbors
def cross_validate(features, labels, n):
kf = KFold(n_splits = 4, random_state = 1)
scores = []

# Split data
for train_idx, test_idx in kf.split(features, labels):
X_train, X_test = features.loc[train_idx], features.loc[test_idx]
y_train, y_test = labels.loc[train_idx], labels.loc[test_idx]
# Train & predict with each fold
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
accuracy = accuracy_score(y_test, predictions, normalize = True)
scores.append(accuracy)
return scores

In [37]:
# Experiment average recall score with different k(s)
k_num = range(1, 16)
avg_accuracy = []
for n in k_num:
scores = cross_validate(handwritten_digits_features, handwritten_digits_labels, n)
avg_accuracy.append(np.mean(scores))

In [38]:
# Visualize K, recall score correlation
plt.plot(k_num, avg_accuracy)
plt.xlim(1, 15)

Out[38]:
(1, 15)

While the accuracy score is not bad, there are a few downsides to using k-nearest neighbors:

• high memory usage (for each new unseen observation, many comparisons need to be made to seen observations)
• no model representation to debug and explore

Let's now try a neural network with a single hidden layer. Use the MLPClassifier package from scikit-learn.

## Build a Neural Network model from scratch¶

-- based on guides from Coursera Machine Learning course by Andrew Ng from Stanford

### Split data¶

In [39]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split
# Split for train and test
X_train, X_test, y_train, y_test = train_test_split(handwritten_digits_features,
handwritten_digits_labels,
test_size = 0.2,
random_state = 1)

# Split for final result of train(60%) and val(20%)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size = 0.25,
random_state = 1)


### Feedforward Propagation¶

#### Activation function¶

In [40]:
# Vectorized Sigmoid function
def sigmoid(X):
return 1/(1+np.exp(-X))


#### Randomly initialize the parameters for symmetry breaking¶

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{l}$ uniformly in the range $[-\epsilon_{init}, \epsilon_{init}]$.1, 2 This range of values ensures that the parameters are kept small and makes the learning more efficient.

1. One effective strategy for choosing εinit is to base it on the number of units in the network. A good choice of $\epsilon{init}$ is $\epsilon{init} = \frac{\sqrt{6}}{\sqrt{L{in}+L{out}}}$, where $L{in} = s{l}$ and $L{out}= s_{l+1}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$.

2. For $x\in[0, 1]$, $x\times2\times\epsilon_{init}-\epsilon_{init}\in[-\epsilon_{init}, \epsilon_{init}]$
In [41]:
# Randomly initialize weights with parameter hidden_layer_size in list datatype
# Returns the same weights in matrix and vector form

def rand_weights_init(X, n_labels, hidden_layer_sizes):
# Number of neurons in input layer
input_layer_size = X.shape[1]

# Compute number of neurons in all layers in the nueral network model
all_layers = [input_layer_size]
all_layers = all_layers+hidden_layer_sizes+ [n_labels]

n_layers = len(all_layers)

# Initialize weights for all layers
weights = []

# Initialize unrolled weights for later use
weights_vec = []

for n in range(n_layers-1):
in_num = all_layers[n]
out_num = all_layers[n+1]

# Initialize epsilon
epsilon_init = np.log(6)/np.log(in_num + out_num)

np.random.seed(0)
# w is set to a matrix of size(out_num, 1 + in_num) as the first column of w handles the "bias" terms
w = np.random.rand(out_num, 1 + in_num) * 2 * epsilon_init - epsilon_init

# Unroll w for later use in fmin_cg function to minimize cost
w_vec = np.concatenate(w)

weights.append(w)
weights_vec.append(w_vec)

print('Number of weights in matrix: ', len(weights))

return np.asarray(weights), np.concatenate(np.asarray(weights_vec))

In [42]:
# Test run with weights initialization: 1 hidden layer with 8 neurons
weights, weights_vector = rand_weights_init(X_train, 10, [8])
print('Theta1 shape: ', weights[0].shape, '\n','Theta2 shape: ', weights[1].shape)
print('Unrolled weights: ', weights_vector.shape)

Number of weights in matrix:  2
Theta1 shape:  (8, 65)
Theta2 shape:  (10, 9)
Unrolled weights:  (610,)

In [43]:
# Test run with weights initialization: 2 hidden layer both with 8 neurons
weights2, weights_vector2 = rand_weights_init(X_train, 10, [8, 8])
print('Theta1 shape: ', weights2[0].shape,
'\n','Theta2 shape: ', weights2[1].shape,
'\n','Theta3 shape: ', weights2[2].shape)
print('Unrolled weights: ', weights_vector2.shape)

Number of weights in matrix:  3
Theta1 shape:  (8, 65)
Theta2 shape:  (8, 9)
Theta3 shape:  (10, 9)
Unrolled weights:  (682,)


#### Regularized Cost function of Neural Network with Sigmoid function¶

The cost function for the neural network without regularization is: $J(\theta)= \frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{k=1}^K[−y^{(i)}_{k}log((h_\theta(x^{(i)}))_k)−(1−y^{(i)}_{k})log(1−(h_\theta(x^{(i)}))_k)]$, where $m$ is the number of samples, $K$ is the number of labels, $h_\theta(x)$ is the hypothesis, in our case the Sigmoid function.

The regularized cost function adds the regularization terms: $\frac{\lambda}{2m}\sum\limits_{l = 1}^{L-1}\sum\limits_{no=1}^{N_{out}}\sum\limits_{ni=1}^{N_{in}}(\Theta^{(l)}_{no, ni})^{2}$, where L is the total number of layers in the model, $N_{in}$ and $N_{out}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$

In [44]:
# Unregularized cost function
def cost_func(y_prob, y_actual, sample_size):

# Convert y_actual into dummy/indicator variables
y = y_actual.astype('category').copy()

"""
Note to specify dtype when converting categorical dataframe to numpy array
If dtype not specified, -y will be operated on string type where if y = 1, -y = 255
"""
y = np.asarray(pd.get_dummies(y_actual), dtype = np.int64)

#     # Convert y_pred into dummy/indicator variables
#     # Step 1: Initialize y_pred dummie form with all zeros
#     h = np.zeros(y.shape)

#     # Step 2: Replace zeros with ones using y_pred as indices
#     rows = np.arange(h.shape[0])
#     h[rows, y_pred] = 1

# Compute cost
cost = np.sum(-y*np.log10(y_prob)-(1-y)*np.log10(1-y_prob))/sample_size

return cost


#### Feedforward with Regularized Cost function¶

In [45]:
# Feedforward using sigmoid function
def feedforward(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):

# Define some useful variables
n_samples = X.shape[0]
input_layer_size = X.shape[1]

# Compute number of neurons in all layers in the nueral network model
all_layers = [input_layer_size]
all_layers = all_layers+hidden_layer_sizes+ [n_labels]

#     print(all_layers)

n_activation = len(all_layers)-1

# Initialize input
x = np.asarray(X)

#     # Initialize output
#     predictions = np.empty((n_samples, n_labels))

# Initialize a list of weights in matrix form
weights_matrix = []

# Initialize weights without bias for regularization term
weights_no_bias = []

# Initialize activaitons x for calculating 'error term' in backpropagation later
activations_x = []

# Feedforward
for n in range(n_activation):
x = np.insert(x, 0, 1, axis = 1)

# Append activation x to activations, last activation is also the output
activations_x.append(x)

# Reshape weights vector to weights matrix based on neural network structure
init_slice = 0
weights_count = (all_layers[n]+1)*all_layers[n+1]
weights = np.reshape(vec_weights[init_slice:(init_slice + weights_count)], (all_layers[n+1], (all_layers[n]+1)))

# Append weights to the list of weights in matrix form
weights_matrix.append(weights)

# Compute activation
activation = sigmoid(np.dot(x, weights.T))

# Assign output as the input for next layer, or as final output if loop ends
x = activation

# Get weights without bias in each layerfor later calculating regularization term
no_bias= np.delete(weights, 0, axis = 1)

# Unroll weights in each layer into a numpy 1d array & append to weights_no_bias
weights_no_bias.append(np.concatenate(no_bias))

# Update init_slice
init_slice = weights_count

# Sanity check: x as output should match the shape of predefined p
y_prob = x.copy()

# Assign index of the max value of each row as predictions
y_pred = np.argmax(y_prob, axis = 1)

# Compute unregularized cost
cost = cost_func(y_prob, y, n_samples)

# Unroll weights into a numpy 1d array
weights_no_bias = np.concatenate(np.array(weights_no_bias))

# Regularization term
reg_term = np.sum(weights_no_bias**2)*alpha/(2*n_samples)

# Compute regularized cost
reg_cost = cost + reg_term

return reg_cost, y_prob, y_pred, activations_x, weights_matrix

In [46]:
# Test feedforward with weights initialization:
# 1 hidden layer with 8 neurons
# from rand_weights_init test run & alpha range(5)

for a in range(5):
reg_cost1, probabilties1, predictions1, activations_x1, weights_matrix1 = feedforward(
vec_weights = weights_vector,
n_labels=10,
X=X_train, y = y_train,
hidden_layer_sizes = [8],
alpha = a)
print('With alpha {}, the regularized cost is: '.format(a), reg_cost1)
print('*'*55)

print('Probabilities shape: ', probabilties1.shape)
print('Predictions shape: ', predictions1.shape, 'predictions: ', predictions1)
print('Activations number:', len(activations_x1))
print('Activation x1 shape', activations_x1[0].shape)
print('Activation x2 shape', activations_x1[1].shape)
print('weights 1 shape', weights_matrix1[0].shape)
print('weights 2 shape', weights_matrix1[1].shape)

With alpha 0, the regularized cost is:  2.945120500303429
*******************************************************
With alpha 1, the regularized cost is:  2.9614436955038648
*******************************************************
With alpha 2, the regularized cost is:  2.9777668907043
*******************************************************
With alpha 3, the regularized cost is:  2.9940900859047357
*******************************************************
With alpha 4, the regularized cost is:  3.0104132811051714
*******************************************************
Probabilities shape:  (1077, 10)
Predictions shape:  (1077,) predictions:  [4 4 2 ... 4 0 2]
Activations number: 2
Activation x1 shape (1077, 65)
Activation x2 shape (1077, 9)
weights 1 shape (8, 65)
weights 2 shape (10, 9)

In [47]:
# Test feedforward with weights initialization:
# 2 hidden layer both with 8 neurons
# from rand_weights_init test run & alpha range(5)

for a in range(5):
reg_cost2, probabilties2, predictions2, activations_x2, weights_matrix2 = feedforward(
vec_weights = weights_vector2,
hidden_layer_sizes = [8, 8],
n_labels=10,
X=X_train, y = y_train,
alpha = a)
print('With alpha {}, the regularized cost is: '.format(a), reg_cost2)
print('*'*55)

print('Probabilities shape: ', probabilties2.shape)
print('Predictions shape: ', predictions2.shape, 'predictions: ', predictions2)
print('Activations x number:', len(activations_x2))
print('Activation x1 shape', activations_x2[0].shape)
print('Activation x2 shape', activations_x2[1].shape)
print('Activation x3 shape', activations_x2[2].shape)
print('weights 1 shape', weights_matrix2[0].shape)
print('weights 2 shape', weights_matrix2[1].shape)
print('weights 3 shape', weights_matrix2[2].shape)

With alpha 0, the regularized cost is:  3.0412919452900082
*******************************************************
With alpha 1, the regularized cost is:  3.059339249288558
*******************************************************
With alpha 2, the regularized cost is:  3.077386553287108
*******************************************************
With alpha 3, the regularized cost is:  3.0954338572856575
*******************************************************
With alpha 4, the regularized cost is:  3.1134811612842075
*******************************************************
Probabilities shape:  (1077, 10)
Predictions shape:  (1077,) predictions:  [2 2 2 ... 2 2 2]
Activations x number: 3
Activation x1 shape (1077, 65)
Activation x2 shape (1077, 9)
Activation x3 shape (1077, 9)
weights 1 shape (8, 65)
weights 2 shape (8, 9)
weights 3 shape (10, 9)


### Backpropagation¶

The intuition behind the backpropagation algorithm is as follows:

• Given a training example $(x^{(t)},y^{(t)})$, we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_Θ(x)$.

• Then, for each node $j$ in layer $l$, we would like to compute an “error term” $\delta_j^{(l)}$ that measures how much that node was “responsible” for any errors in our output. For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta_j^{(output)}$. For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$.

#### Derivation of the Sigmoid function (for calculating gradient of the cost function in backpropagation)¶

1. Sigmoid function: $g(x) = \frac{1}{(1+ e^{-x})}$

2. $g(x) = \frac{1}{(1+ e^{-x})} = (1+e^{-x})^{-1}$

3. Apply Chain Rule: $g'(x) = -1\times(1+e^{-x})^{-2}\times(1+e^{-x})'$

4. Apply Exponential Rule $g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times(-x)')$

5. $g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times-1)$

6. $g'(x) = \frac{1}{(1+e^{-x})^{2}}\times e^{-x}$

7. $g'(x) = \frac{1}{(1+e^{-x})}\times\frac{e^{-x}}{{(1+e^{-x})}}$

8. $g'(x) = g(x)\times\frac{1+e^{-x}-1}{{(1+e^{-x})}}$

9. $g'(x) = g(x)\times(1-g(x))$
In [48]:
# Derivative of the Sigmoid function
return sigmoid(X)*(1-sigmoid(X))

In [49]:
# Sanity check sigmoid_gradient

Out[49]:
0.25
In [50]:
"""
Implement the backpropagation algorithm to compute the gradients of weights to obtain
the gradient for the neural network cost function.

This function returns the partial derivatives of the cost function with respect to the weights in each hidden layer,
for later use in fmin_cg function to minimize cost.
"""

def backpropagation(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):

reg_cost, y_prob, y_pred, activations_x, weights = feedforward(vec_weights,
hidden_layer_sizes,
n_labels,
X, y, alpha)

n_activations = len(activations_x)
n_samples = X.shape[0]

# Turn y into dummy index
y = pd.get_dummies(y.astype('category'))
y = np.array(y, dtype = np.int64)

# Compute 'error term' delta of the output layer
last_layer_weights_delta = y_prob - y

# Initialze a list of deltas from each hidden layer
layer_weights_delta = []

# Compute 'error term' delta of the hidden layers
for n in reversed(range(n_activations)):
delta = np.dot(last_layer_weights_delta.T, activations_x[n])
avg_delta = delta/n_samples

# Compute regularization of the corresponding weights_grad
weights_temp = weights[n].copy()

# Replace the first column of weights with 0 to leave out bias column in regularized cost
weights_temp[:,1] = 0
reg_term = (alpha/n_samples)*weights_temp.copy()

# Update weights_grad with regularization term

# Unroll delta and insert it as the first item in weights_gradient list

# Update last_layer_weights_delta
last_layer_weights_delta = np.dot(last_layer_weights_delta,

return np.concatenate(layer_weights_delta)

In [51]:
# Test backpropagation with feedforwd test results
hidden_layer_sizes = [8, 16, 8],
n_labels=10,
X=X_train, y = y_train,
alpha = 1)

Theta gradient 4 shape:  (10, 9)
Theta gradient 3 shape:  (8, 17)
Theta gradient 2 shape:  (16, 9)
Theta gradient 1 shape:  (8, 65)

Out[51]:
(890,)

### Put everything together¶

Put all the functions we build so far together, as one function that computes the regularized Neural Network model cost and weights gradient

In [52]:
# Create a function to minize that returns the cost
def nn_cost(vec_weights, *args):

hidden_layer_sizes, n_labels, X, y, alpha = args
reg_cost, y_prob, predictions, activations_x, weights = feedforward(vec_weights,
hidden_layer_sizes,
n_labels,
X, y,
alpha)

return reg_cost

In [53]:
# Create a function to minize that returns the weights cost gradient
hidden_layer_sizes, n_labels, X, y, alpha = args
return backpropagation(vec_weights,
hidden_layer_sizes,
n_labels,
X, y,
alpha)

In [54]:
# Initialize new weights with rand_weights_init and test nn_cost, nn_cost_gradient
test_weights_m, test_weights_vec = rand_weights_init(X_train, 10, [8,16,8])

args = ([8], 10, X_train, y_train, 0)
cost = nn_cost(test_weights_vec, *args)
print(cost)

Number of weights in matrix:  4
Theta gradient 2 shape:  (10, 9)
Theta gradient 1 shape:  (8, 65)
2.945120500303429


### Minimize the nn_cost function using a nonlinear conjugate gradient algorithm¶

Using the function fmin_cg from SciPy library to minimize our cost function

In [55]:
from scipy.optimize import fmin_cg
from scipy.optimize import minimize

In [62]:
weights_1_m, weights_1_v = rand_weights_init(X_train, 10, [8])

Number of weights in matrix:  2

In [64]:
# Using nn_cost
optim_weights = fmin_cg(f = nn_cost, x0 = weights_1_v, args = args, maxiter = 100)

Warning: Maximum number of iterations has been exceeded.
Current function value: 0.644364
Iterations: 100
Function evaluations: 110772

In [65]:
nn_cost(optim_weights, *args)

Out[65]:
0.6443644765515987
In [66]:
reg_cost_optm, y_prob_optm, predictions_optm, activations_x_optm, weights_optm = feedforward(optim_weights,
[8],
10,
X_train, y_train,
alpha=1)

In [67]:
accuracy = (predictions_optm == y_train).mean()

In [68]:
accuracy

Out[68]:
0.649025069637883
In [ ]: