Building A Handwritten Digits Classifier


In this Guided Project, we'll explore the effectiveness of deep, feedforward neural networks at classifying images.

In [32]:
# Scikit-learn contains a number of datasets pre-loaded with the library, within the namespace of sklearn.datasets. 
# The load_digits() function returns a copy of the hand-written digits dataset from UCI.
from sklearn.datasets import load_digits
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

Read in dataset as Dataframe

In [33]:
# Create a dataframe that contains the handwritten_digits features and a series that contains corresponding labels
handwritten_digits_features, handwritten_digits_labels = load_digits(return_X_y= True)
handwritten_digits_features = pd.DataFrame(handwritten_digits_features)
handwritten_digits_labels = pd.Series(handwritten_digits_labels)
In [34]:
handwritten_digits_features.head(), handwritten_digits_labels.head()
Out[34]:
(    0    1    2     3     4     5    6    7    8    9  ...    54   55   56  \
 0  0.0  0.0  5.0  13.0   9.0   1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0   
 1  0.0  0.0  0.0  12.0  13.0   5.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0   
 2  0.0  0.0  0.0   4.0  15.0  12.0  0.0  0.0  0.0  0.0 ...   5.0  0.0  0.0   
 3  0.0  0.0  7.0  15.0  13.0   1.0  0.0  0.0  0.0  8.0 ...   9.0  0.0  0.0   
 4  0.0  0.0  0.0   1.0  11.0   0.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0   
 
     57   58    59    60    61   62   63  
 0  0.0  6.0  13.0  10.0   0.0  0.0  0.0  
 1  0.0  0.0  11.0  16.0  10.0  0.0  0.0  
 2  0.0  0.0   3.0  11.0  16.0  9.0  0.0  
 3  0.0  7.0  13.0  13.0   9.0  0.0  0.0  
 4  0.0  0.0   2.0  16.0   4.0  0.0  0.0  
 
 [5 rows x 64 columns], 0    0
 1    1
 2    2
 3    3
 4    4
 dtype: int64)

In the handwritten_digits_features we see that each image (as each row) has 64 grayscale points, which means they were 8*8 images.

Visualize a few sample data

In [35]:
# Transform row 0, 100, 200, 300, 1000, 1100, 1200, 1300 to 8*8 pixel grid and visualize the data
rows = [0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
fig = plt.figure(figsize = (3, 3))
i = 1
for row in rows:
    # Transform selected datapoint to np array then reshape to original grid
    img = handwritten_digits_features.iloc[row].values.reshape(8, 8) 
    ax = fig.add_subplot(4, 4, i)
    ax.imshow(img, cmap = 'gray_r')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    i+=1  

Experiment with KNeighborsClassifier

In [36]:
# Training k-nearest neighbors models.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# Performs 4-fold cross validation using k-nearest neighbors
def cross_validate(features, labels, n):
    kf = KFold(n_splits = 4, random_state = 1)
    scores = []

    # Split data
    for train_idx, test_idx in kf.split(features, labels):
        X_train, X_test = features.loc[train_idx], features.loc[test_idx]
        y_train, y_test = labels.loc[train_idx], labels.loc[test_idx]
        # Train & predict with each fold 
        knn = KNeighborsClassifier(n_neighbors = n)
        knn.fit(X_train, y_train)
        predictions = knn.predict(X_test)
        accuracy = accuracy_score(y_test, predictions, normalize = True)
        scores.append(accuracy)
    return scores    
In [37]:
# Experiment average recall score with different k(s)
k_num = range(1, 16)
avg_accuracy = []
for n in k_num:
    scores = cross_validate(handwritten_digits_features, handwritten_digits_labels, n)
    avg_accuracy.append(np.mean(scores))
In [38]:
# Visualize K, recall score correlation
plt.plot(k_num, avg_accuracy)
plt.xlim(1, 15)
Out[38]:
(1, 15)

While the accuracy score is not bad, there are a few downsides to using k-nearest neighbors:

  • high memory usage (for each new unseen observation, many comparisons need to be made to seen observations)
  • no model representation to debug and explore

Let's now try a neural network with a single hidden layer. Use the MLPClassifier package from scikit-learn.

Build a Neural Network model from scratch

-- based on guides from Coursera Machine Learning course by Andrew Ng from Stanford

Split data

In [39]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split
# Split for train and test
X_train, X_test, y_train, y_test = train_test_split(handwritten_digits_features, 
                                                    handwritten_digits_labels, 
                                                   test_size = 0.2,
                                                   random_state = 1)

# Split for final result of train(60%) and val(20%)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                 test_size = 0.25, 
                                                 random_state = 1)

Feedforward Propagation

Activation function

In [40]:
# Vectorized Sigmoid function
def sigmoid(X):
    return 1/(1+np.exp(-X))

Randomly initialize the parameters for symmetry breaking


When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{l}$ uniformly in the range $[-\epsilon_{init}, \epsilon_{init}]$.1, 2 This range of values ensures that the parameters are kept small and makes the learning more efficient.


  1. One effective strategy for choosing εinit is to base it on the number of units in the network. A good choice of $ \epsilon{init} $ is $ \epsilon{init} = \frac{\sqrt{6}}{\sqrt{L{in}+L{out}}} $, where $L{in} = s{l} $ and $L{out}= s_{l+1}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$.

  2. For $ x\in[0, 1] $, $x\times2\times\epsilon_{init}-\epsilon_{init}\in[-\epsilon_{init}, \epsilon_{init}] $
In [41]:
# Randomly initialize weights with parameter hidden_layer_size in list datatype
# Returns the same weights in matrix and vector form

def rand_weights_init(X, n_labels, hidden_layer_sizes):
    # Number of neurons in input layer
    input_layer_size = X.shape[1]
    
    # Compute number of neurons in all layers in the nueral network model 
    all_layers = [input_layer_size] 
    all_layers = all_layers+hidden_layer_sizes+ [n_labels]

    n_layers = len(all_layers)
    
    # Initialize weights for all layers 
    weights = []
    
    # Initialize unrolled weights for later use
    weights_vec = []
    
    for n in range(n_layers-1):
        in_num = all_layers[n]
        out_num = all_layers[n+1]
        
        # Initialize epsilon
        epsilon_init = np.log(6)/np.log(in_num + out_num)
 
        np.random.seed(0)
        # w is set to a matrix of size(out_num, 1 + in_num) as the first column of w handles the "bias" terms
        w = np.random.rand(out_num, 1 + in_num) * 2 * epsilon_init - epsilon_init
        
        # Unroll w for later use in fmin_cg function to minimize cost
        w_vec = np.concatenate(w)
        
        weights.append(w)
        weights_vec.append(w_vec)
        
    print('Number of weights in matrix: ', len(weights))    
                     
    return np.asarray(weights), np.concatenate(np.asarray(weights_vec))
In [42]:
# Test run with weights initialization: 1 hidden layer with 8 neurons
weights, weights_vector = rand_weights_init(X_train, 10, [8])
print('Theta1 shape: ', weights[0].shape, '\n','Theta2 shape: ', weights[1].shape)
print('Unrolled weights: ', weights_vector.shape)
Number of weights in matrix:  2
Theta1 shape:  (8, 65) 
 Theta2 shape:  (10, 9)
Unrolled weights:  (610,)
In [43]:
# Test run with weights initialization: 2 hidden layer both with 8 neurons
weights2, weights_vector2 = rand_weights_init(X_train, 10, [8, 8])
print('Theta1 shape: ', weights2[0].shape, 
      '\n','Theta2 shape: ', weights2[1].shape, 
      '\n','Theta3 shape: ', weights2[2].shape)
print('Unrolled weights: ', weights_vector2.shape)
Number of weights in matrix:  3
Theta1 shape:  (8, 65) 
 Theta2 shape:  (8, 9) 
 Theta3 shape:  (10, 9)
Unrolled weights:  (682,)

Regularized Cost function of Neural Network with Sigmoid function


The cost function for the neural network without regularization is: $ J(\theta)= \frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{k=1}^K[−y^{(i)}_{k}log((h_\theta(x^{(i)}))_k)−(1−y^{(i)}_{k})log(1−(h_\theta(x^{(i)}))_k)] $, where $m$ is the number of samples, $K$ is the number of labels, $h_\theta(x)$ is the hypothesis, in our case the Sigmoid function.

The regularized cost function adds the regularization terms: $ \frac{\lambda}{2m}\sum\limits_{l = 1}^{L-1}\sum\limits_{no=1}^{N_{out}}\sum\limits_{ni=1}^{N_{in}}(\Theta^{(l)}_{no, ni})^{2} $, where L is the total number of layers in the model, $N_{in}$ and $N_{out}$ are the number of units(neurons) in the layers adjacent to $\Theta^{l}$

In [44]:
# Unregularized cost function
def cost_func(y_prob, y_actual, sample_size):

    # Convert y_actual into dummy/indicator variables
    y = y_actual.astype('category').copy()
    
    """
    Note to specify dtype when converting categorical dataframe to numpy array 
    If dtype not specified, -y will be operated on string type where if y = 1, -y = 255
    """
    y = np.asarray(pd.get_dummies(y_actual), dtype = np.int64)

#     # Convert y_pred into dummy/indicator variables 
#     # Step 1: Initialize y_pred dummie form with all zeros    
#     h = np.zeros(y.shape)
    
#     # Step 2: Replace zeros with ones using y_pred as indices 
#     rows = np.arange(h.shape[0])
#     h[rows, y_pred] = 1
    
    # Compute cost 
    cost = np.sum(-y*np.log10(y_prob)-(1-y)*np.log10(1-y_prob))/sample_size
    
    return cost

Feedforward with Regularized Cost function

In [45]:
# Feedforward using sigmoid function 
def feedforward(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):
    
    # Define some useful variables
    n_samples = X.shape[0]
    input_layer_size = X.shape[1]
    
    # Compute number of neurons in all layers in the nueral network model     
    all_layers = [input_layer_size] 
    all_layers = all_layers+hidden_layer_sizes+ [n_labels]

#     print(all_layers)
    
    n_activation = len(all_layers)-1
   

    # Initialize input 
    x = np.asarray(X)
    
#     # Initialize output
#     predictions = np.empty((n_samples, n_labels))

    # Initialize a list of weights in matrix form 
    weights_matrix = []
    
    # Initialize weights without bias for regularization term
    weights_no_bias = []
    
    # Initialize activaitons x for calculating 'error term' in backpropagation later
    activations_x = []
    
    # Feedforward 
    for n in range(n_activation):
        # Add bias column 
        x = np.insert(x, 0, 1, axis = 1)
        
        # Append activation x to activations, last activation is also the output
        activations_x.append(x)
        
        # Reshape weights vector to weights matrix based on neural network structure
        init_slice = 0
        weights_count = (all_layers[n]+1)*all_layers[n+1]
        weights = np.reshape(vec_weights[init_slice:(init_slice + weights_count)], (all_layers[n+1], (all_layers[n]+1)))
        
        # Append weights to the list of weights in matrix form 
        weights_matrix.append(weights)
        
        # Compute activation
        activation = sigmoid(np.dot(x, weights.T))  
        
        # Assign output as the input for next layer, or as final output if loop ends 
        x = activation
        
        # Get weights without bias in each layerfor later calculating regularization term
        no_bias= np.delete(weights, 0, axis = 1)
        
        # Unroll weights in each layer into a numpy 1d array & append to weights_no_bias
        weights_no_bias.append(np.concatenate(no_bias))
        
        # Update init_slice 
        init_slice = weights_count
        
    # Sanity check: x as output should match the shape of predefined p
    y_prob = x.copy()

    # Assign index of the max value of each row as predictions 
    y_pred = np.argmax(y_prob, axis = 1)
    
    # Compute unregularized cost 
    cost = cost_func(y_prob, y, n_samples)
    
    # Unroll weights into a numpy 1d array
    weights_no_bias = np.concatenate(np.array(weights_no_bias))
    
    # Regularization term 
    reg_term = np.sum(weights_no_bias**2)*alpha/(2*n_samples)
    
    # Compute regularized cost 
    reg_cost = cost + reg_term
    
    return reg_cost, y_prob, y_pred, activations_x, weights_matrix
In [46]:
# Test feedforward with weights initialization: 
# 1 hidden layer with 8 neurons 
# from rand_weights_init test run & alpha range(5)

for a in range(5):
    reg_cost1, probabilties1, predictions1, activations_x1, weights_matrix1 = feedforward( 
                                        vec_weights = weights_vector, 
                                        n_labels=10, 
                                        X=X_train, y = y_train,
                                        hidden_layer_sizes = [8],
                                        alpha = a)
    print('With alpha {}, the regularized cost is: '.format(a), reg_cost1)
    print('*'*55)

print('Probabilities shape: ', probabilties1.shape)
print('Predictions shape: ', predictions1.shape, 'predictions: ', predictions1)
print('Activations number:', len(activations_x1))
print('Activation x1 shape', activations_x1[0].shape)
print('Activation x2 shape', activations_x1[1].shape)
print('weights 1 shape', weights_matrix1[0].shape)
print('weights 2 shape', weights_matrix1[1].shape)
With alpha 0, the regularized cost is:  2.945120500303429
*******************************************************
With alpha 1, the regularized cost is:  2.9614436955038648
*******************************************************
With alpha 2, the regularized cost is:  2.9777668907043
*******************************************************
With alpha 3, the regularized cost is:  2.9940900859047357
*******************************************************
With alpha 4, the regularized cost is:  3.0104132811051714
*******************************************************
Probabilities shape:  (1077, 10)
Predictions shape:  (1077,) predictions:  [4 4 2 ... 4 0 2]
Activations number: 2
Activation x1 shape (1077, 65)
Activation x2 shape (1077, 9)
weights 1 shape (8, 65)
weights 2 shape (10, 9)
In [47]:
# Test feedforward with weights initialization: 
# 2 hidden layer both with 8 neurons 
# from rand_weights_init test run & alpha range(5)

for a in range(5):
    reg_cost2, probabilties2, predictions2, activations_x2, weights_matrix2 = feedforward( 
                                        vec_weights = weights_vector2, 
                                        hidden_layer_sizes = [8, 8],
                                        n_labels=10, 
                                        X=X_train, y = y_train,
                                        alpha = a)
    print('With alpha {}, the regularized cost is: '.format(a), reg_cost2)
    print('*'*55)

print('Probabilities shape: ', probabilties2.shape)
print('Predictions shape: ', predictions2.shape, 'predictions: ', predictions2)
print('Activations x number:', len(activations_x2))
print('Activation x1 shape', activations_x2[0].shape)
print('Activation x2 shape', activations_x2[1].shape)
print('Activation x3 shape', activations_x2[2].shape)
print('weights 1 shape', weights_matrix2[0].shape)
print('weights 2 shape', weights_matrix2[1].shape)
print('weights 3 shape', weights_matrix2[2].shape)
With alpha 0, the regularized cost is:  3.0412919452900082
*******************************************************
With alpha 1, the regularized cost is:  3.059339249288558
*******************************************************
With alpha 2, the regularized cost is:  3.077386553287108
*******************************************************
With alpha 3, the regularized cost is:  3.0954338572856575
*******************************************************
With alpha 4, the regularized cost is:  3.1134811612842075
*******************************************************
Probabilities shape:  (1077, 10)
Predictions shape:  (1077,) predictions:  [2 2 2 ... 2 2 2]
Activations x number: 3
Activation x1 shape (1077, 65)
Activation x2 shape (1077, 9)
Activation x3 shape (1077, 9)
weights 1 shape (8, 65)
weights 2 shape (8, 9)
weights 3 shape (10, 9)

Backpropagation


The intuition behind the backpropagation algorithm is as follows:

  • Given a training example $(x^{(t)},y^{(t)})$, we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_Θ(x)$.

  • Then, for each node $j$ in layer $l$, we would like to compute an “error term” $\delta_j^{(l)}$ that measures how much that node was “responsible” for any errors in our output. For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta_j^{(output)}$. For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$.

Derivation of the Sigmoid function (for calculating gradient of the cost function in backpropagation)


  1. Sigmoid function: $ g(x) = \frac{1}{(1+ e^{-x})} $

  2. $ g(x) = \frac{1}{(1+ e^{-x})} = (1+e^{-x})^{-1}$

  3. Apply Chain Rule: $ g'(x) = -1\times(1+e^{-x})^{-2}\times(1+e^{-x})' $

  4. Apply Exponential Rule $ g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times(-x)') $

  5. $ g'(x) = -(1+e^{-x})^{-2}\times(e^{-x}\times-1) $

  6. $ g'(x) = \frac{1}{(1+e^{-x})^{2}}\times e^{-x} $

  7. $ g'(x) = \frac{1}{(1+e^{-x})}\times\frac{e^{-x}}{{(1+e^{-x})}} $

  8. $ g'(x) = g(x)\times\frac{1+e^{-x}-1}{{(1+e^{-x})}} $

  9. $ g'(x) = g(x)\times(1-g(x)) $
In [48]:
# Derivative of the Sigmoid function 
def sigmoid_gradient(X):
    return sigmoid(X)*(1-sigmoid(X))
In [49]:
# Sanity check sigmoid_gradient 
sigmoid_gradient(0)
Out[49]:
0.25
In [50]:
"""
Implement the backpropagation algorithm to compute the gradients of weights to obtain 
the gradient for the neural network cost function. 

This function returns the partial derivatives of the cost function with respect to the weights in each hidden layer, 
for later use in fmin_cg function to minimize cost.
"""

def backpropagation(vec_weights, hidden_layer_sizes, n_labels, X, y, alpha):
    
    reg_cost, y_prob, y_pred, activations_x, weights = feedforward(vec_weights, 
                                                                   hidden_layer_sizes, 
                                                                   n_labels, 
                                                                   X, y, alpha)
    

    
    
    n_activations = len(activations_x)
    n_samples = X.shape[0]

    # Turn y into dummy index 
    y = pd.get_dummies(y.astype('category'))
    y = np.array(y, dtype = np.int64)
    
    # Compute 'error term' delta of the output layer
    last_layer_weights_delta = y_prob - y
    
    # Initialze a list of deltas from each hidden layer
    layer_weights_delta = []
    
    # Compute 'error term' delta of the hidden layers
    for n in reversed(range(n_activations)):
        delta = np.dot(last_layer_weights_delta.T, activations_x[n])
        avg_delta = delta/n_samples

        # Compute regularization of the corresponding weights_grad
        weights_temp = weights[n].copy()

        # Replace the first column of weights with 0 to leave out bias column in regularized cost
        weights_temp[:,1] = 0
        reg_term = (alpha/n_samples)*weights_temp.copy()

        # Update weights_grad with regularization term 
        layer_grad_reg = avg_delta+reg_term
        print('Theta gradient {} shape: '.format(n+1), layer_grad_reg.shape)
        
        # Unroll delta and insert it as the first item in weights_gradient list 
        layer_weights_delta.insert(0, np.concatenate(layer_grad_reg))
        
        # Update last_layer_weights_delta
        last_layer_weights_delta = np.dot(last_layer_weights_delta, 
                                          weights[n][:,1:])*sigmoid_gradient(activations_x[n][:,1:])
        
    return np.concatenate(layer_weights_delta)
In [51]:
# Test backpropagation with feedforwd test results
weights_grad = backpropagation(vec_weights = weights_vector2, 
                                        hidden_layer_sizes = [8, 16, 8],
                                        n_labels=10, 
                                        X=X_train, y = y_train,
                                        alpha = 1)
weights_grad.shape
Theta gradient 4 shape:  (10, 9)
Theta gradient 3 shape:  (8, 17)
Theta gradient 2 shape:  (16, 9)
Theta gradient 1 shape:  (8, 65)
Out[51]:
(890,)

Put everything together


Put all the functions we build so far together, as one function that computes the regularized Neural Network model cost and weights gradient

In [52]:
# Create a function to minize that returns the cost
def nn_cost(vec_weights, *args):
    
    hidden_layer_sizes, n_labels, X, y, alpha = args
    reg_cost, y_prob, predictions, activations_x, weights = feedforward(vec_weights, 
                                                                        hidden_layer_sizes,
                                                                        n_labels, 
                                                                        X, y, 
                                                                        alpha)
    
    return reg_cost
In [53]:
# Create a function to minize that returns the weights cost gradient
def nn_cost_gradient(vec_weights, *args):  
    hidden_layer_sizes, n_labels, X, y, alpha = args
    return backpropagation(vec_weights, 
                                       hidden_layer_sizes, 
                                       n_labels, 
                                       X, y, 
                                       alpha)
In [54]:
# Initialize new weights with rand_weights_init and test nn_cost, nn_cost_gradient
test_weights_m, test_weights_vec = rand_weights_init(X_train, 10, [8,16,8])

args = ([8], 10, X_train, y_train, 0)
cost = nn_cost(test_weights_vec, *args)
grad = nn_cost_gradient(test_weights_vec, *args)
print(cost)
Number of weights in matrix:  4
Theta gradient 2 shape:  (10, 9)
Theta gradient 1 shape:  (8, 65)
2.945120500303429

Minimize the nn_cost function using a nonlinear conjugate gradient algorithm


Using the function fmin_cg from SciPy library to minimize our cost function

In [55]:
from scipy.optimize import fmin_cg
from scipy.optimize import minimize
In [62]:
weights_1_m, weights_1_v = rand_weights_init(X_train, 10, [8])
Number of weights in matrix:  2
In [64]:
# Using nn_cost 
optim_weights = fmin_cg(f = nn_cost, x0 = weights_1_v, args = args, maxiter = 100)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.644364
         Iterations: 100
         Function evaluations: 110772
         Gradient evaluations: 181
In [65]:
nn_cost(optim_weights, *args)
Out[65]:
0.6443644765515987
In [66]:
reg_cost_optm, y_prob_optm, predictions_optm, activations_x_optm, weights_optm = feedforward(optim_weights, 
                                                                        [8],
                                                                        10, 
                                                                        X_train, y_train, 
                                                                        alpha=1)
In [67]:
accuracy = (predictions_optm == y_train).mean()
In [68]:
accuracy
Out[68]:
0.649025069637883
In [ ]: