Notebook

CSCI-UA 9473 Introduction to Machine Learning¶

Assignment 3: Convolutional nets, SVM and Robust PCA¶

Given date: April 18 Due date: May 10

Total: 30pts¶

Additional readings (To go further):

Robust principal component analysis? EJ Candès, X Li, Y Ma, J Wright

The assignment is divided into three parts. In the first part, we will go back to neural networks. You will be asked to build and train a convolutional neural network for image classification. In the second part,

Question I: (10pts) conv nets and autonomous driving¶

In this first question, we will use the Keras API to build and train a convolutional neural network to discriminate between four types of road signs. To simplify we will consider the detection of 4 different signs:

A '30 km/h' sign (folder 1)
A 'Stop' sign
A 'Go straight' sign
A 'Keep left' sign

An example of each sign is given below.

In [2]:

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img1 = mpimg.imread('1/00001_00000_00012.png')
plt.subplot(141)
plt.imshow(img1)
plt.axis('off')
plt.subplot(142)
img2 = mpimg.imread('2/00014_00001_00019.png')
plt.imshow(img2)
plt.axis('off')
plt.subplot(143)
img3 = mpimg.imread('3/00035_00008_00023.png')
plt.imshow(img3)
plt.axis('off')
plt.subplot(144)
img4 = mpimg.imread('4/00039_00000_00029.png')
plt.imshow(img4)
plt.axis('off')
plt.show()

Question I.1. (10pts)¶

In this exercise, you need to build and train a convolutional neural network to discriminate between the four images.

Before building the network, you should start by cropping the images so that they all have a common predefined size (take the smallest size across all images)
We will use a Sequential model from Keras but it will be up top you to define the structure of the convolution net. Initialization of the sequential model can be done with the following line

model = Sequential()

I.1.a. Convolutions.¶

We will defintely use convolutional layers. you can add such layers to the model by using the lines

  model.add(Conv2D(num_units, (filter_size1, filter_size2), padding='same',
                       input_shape=(3, IMG_SIZE, IMG_SIZE),
                       activation='relu'))

for the first layer and

    model.add(Conv2D(filters, filter_size, activation, input_shape)

for all the others. 'filters' indicate the number of filters you want to use in the convolutional layer. filter_size is the size of each filter and activation is the usual activation that comes on top of the convolution, i.e. $x_{\text{out}} = \sigma(\text{filter}*\text{input})$. Finally input_shape indicates the size of your input. Note that only the input layer should be given the input size. Subsequent layers will automatically compute the size of their inputs based on previous layers.

I.1.b Pooling Layers¶

On top of the convolutional layers, convolutional neural networks (CNN) also often rely on Pooling layers. The addition of such a layer can be done through the following line

    model.add(MaxPooling2D(pool_size=(filter_sz1, filter_sz2)))

The pooling layers usually come with two parameters: the 'pool size' and the 'stride' parameter. The basic choice for the pool size is (2,2) and the stride is usually set to None (which means it will split the image into non overlapping regions such as in the Figure below). You should however feel free to play a little with those parameters. The MaxPool operator considers a mask of size pool_size which is slided over the image by a number of pixels equal to the stride parameters (in x and y, there are hence two translation parameters). for each position of the mask, the output only retains the max of the pixels appearing in the mask (This idea is illustrated below). One way to understand the effect of the pooling operator is that if the filter detects an edge in a subregion of the image (thus returning at least one large value), although a MaxPooling will reduce the number of parameters, it will keep track of this information.

Adding 'Maxpooling' layers is known to work well in practice.

Although it is a little bit up to you to decide how you want to structure the network, a good start is to add a couple (definitely not exceeding 4) combinations (convolution, convolution, Pooling) with increasing number of units (you do every power of two like 16, 32, 128,...).

I.1.c. Flattening and Fully connected layers¶

Once you have stacked the convolutional and pooling layers, you should flatten the output through a line of the form

    model.add(Flatten())

And add a couple (no need to put more than 2,3) dense fully connected layers through lines of the form

    model.add(Dense(num_units, activation='relu'))

I.1.d. Concluding¶

Since there are four possible signs, you need to finish your network with a dense layer with 4 units. Each of those units should output four number between 0 and 1 representing the likelihood that any of the four sign is detected and such that $p_1 + p_2 + p_3 + p_4 = 1$ (hopefully with one probability much larger than the others). For this reason, a good choice for the final activation function of those four units is the softmax (Why?).

Build your model below.

In [ ]:

model = Sequential()

# construct the model using convolutional layers, dense fully connected layers and 

Question I.2 (3pts). Setting up the optimizer¶

Once you found a good architecture for your network, split the dataset, by retaining about 90% of the images for training and 10% of each folder for test. To train your network in Keras, we need two more steps. The first step is to set up the optimizer. Here again it is a little bit up to you to decide how you want to set up the optimization. Two popular approaches are SGD and ADAM. You will get to choose the learning rate. This rate should however be between 1e-3 and 1e-2. Once you have set up the optimizer, we need to set up the optimization parameters. This includes the loss (we will take it to be the categorical cross entropy which is the extension of the log loss to the multiclass problem).

In [ ]:

from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam

# set up the optimize here
# Myoptimizer = SGD
# Myoptimizer = Adam

model.compile(loss='categorical_crossentropy',
              optimizer=Myoptimizer,
              metrics=['accuracy'])

Question I.3 (2pts). Optimization¶

The last step is to fit the network to your data. Just as any other function in scikit-learn, we use a call to the function 'fit'. The training of neural networks can be done by splitting the dataset into minibatches and using a different batch at each SGD step. This process is repeated over the whole dataset. A complete screening of the dataset is called an epoch. We can then repeat this idea several times. In keras the number of epochs is stored in the 'epochs' parameter and the batch size is stored in the 'batch_size'.

In [ ]:

batch_size = 32
epochs = 30

model.fit(X, t,batch_size=batch_size,epochs=epochs, validation_split=0.2)

Question II (10pts): Max margin classifiers and outliers¶

Consider the dataset below. We would like to learn a classifier for this dataset that maximizes the margin (i.e. such that the distance between the closest points to the plane is maximized). We have seen that one can solve this problem by means of the constrained formulation

\begin{align*} \min_{\mathbf{\beta}} \quad & \|\mathbf{\beta}\|^2 \\ \text{subject to} \quad & y(\mathbf{x}^{(i)})t^{(i)} \geq 1 \end{align*}

where $y(\mathbf{x}^{(i)}) = \mathbf{\beta}^T\mathbf{x}^{(i)} + \beta_0$. We might sometimes want to use a (softer) unconstrained formulation. in particular, when selecting this option, we can use the following function known as the Hinge loss

\begin{align*} \max(0, 1-t^{(i)}y(\mathbf{x}^{(i)})) = \max(0, 1-t^{(i)}(\mathbf{\beta}^T\mathbf{x}^{(i)}+\beta_0)) \end{align*}

For such a loss, we can derive a softer, unconstrained version of the problem as

\begin{align*} \min_{\mathbf{\beta}} \quad & \|\mathbf{\beta}\|^2 + \frac{C}{N}\sum_{i=1}^N \max(0, 1-t^{(i)}(\mathbf{\beta}^T\mathbf{x}^{(i)}+\beta_0)) \end{align*}

In short we penalize a point, only if this point lies on the wrong side of the plane.

In [7]:

import numpy as np
import matplotlib.pyplot as plt

from scipy.io import loadmat
pointsClass1 = loadmat('KernelPointsEx4class1.mat')['PointsEx4class1']
pointsClass2 = loadmat('KernelPointsEx4class2.mat')['PointsEx4class2']


plt.scatter(pointsClass1[:,0], pointsClass1[:,1], c='r')
plt.scatter(pointsClass2[:,0], pointsClass2[:,1], c='b')
plt.show()

Question II.1 (3pts)¶

Start by completing the function below which should return the value and gradient of the hinge loss at a point $\mathbf{x}^{(i)}$. What is the gradient of the hinge loss?

In [ ]:

def HingeLoss(x):
    
    '''Returns the value and gradient of the hinge 
    loss at the point x'''
    
    return value, gradient

Question II.2 (7pts)¶

Once you have the function, implement a function HingeLossSVC that takes as innput a starting weight vector $\mathbf{\beta}$ and intercept $\beta_0$ as well as the set of training points and a value for the parameter $C$ and returns the maximum margin classifier.

In [ ]:

def HingeLossSVC(beta_init, beta0_init training, C):
    
    '''Returns the maximal margin classifier for the 
    training dataset'''
    
    return beta, beta0

Question III (10pts): Robust PCA for video surveillance¶

Principal Component Analysis (PCA) retains an approximation of an original dataset $X$ by focusing on the largest singular values. Such an order $K$ approximation can be obtained from the singular value decomposition $\boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$ by truncating $\boldsymbol U$ to the first $K$ columns, retaining the $K\times K$ diagonal matrix $\boldsymbol \Sigma_k$ as well as the first $K$ rows of $\boldsymbol V^T$, $\boldsymbol V_k^T$, and writing the approximation as $\boldsymbol U_k \boldsymbol \Sigma_k \boldsymbol V^T_k$. This approach is particularly efficient when each of the feature vectors (or images in this case) are close to each other. When there is sharp variations across images, such as when an object appears, move throughout the images and then dissapears, a simple PCA does not suffice anymore and one might want to extend it to something more robust. The escalator sequence below is an example of such sequence.

In [3]:

import numpy as np
import cv2
import matplotlib.pyplot as plt

# read video

import scipy.io
movie = scipy.io.loadmat('escalator_data.mat')
#frame0 = 
print(np.shape(movie['X'][:,0]))

plt.imshow(movie['X'][:,0].reshape((160,130)).swapaxes(0, 1), cmap='gray')
plt.show()

(20800,)

The idea of Robust PCA is to add a "sparse" component to the traditional PCA decomposition. Given a collection of images that we store as the columns of the matrix $X$, one then looks for a decomposition

\begin{align} \boldsymbol X = \boldsymbol Y + \boldsymbol S \end{align}

Where $Y$ is a matrix which encode the original PCA model, and thus encodes the part of the images that remains approximately constant throughout the sequence, and $\boldsymbol S$ is the sparse part (i.e a sequence of images that are varying through the sequence but only at a precise position in the images, that is to say with most of the pixels being zero). To recover each part one approach is to proceed as follows, see Candes et al.

We let $\mu$ to denote the parameter that controls the amount of dara we want to store in the sparse foreground extraction part, $\boldsymbol S$. The algorithm proceeds as follows

Initialize $Y$, $S$ to $0$

Step 1. Compute the truncated SVD of the matrix $X - S - \mu^{-1}Y$, i.e. let $X - S - \mu^{-1}Y = U\Sigma V^T$. The truncated SVD is then obtained by replacing the diagonal matrix of singular values with the truncation $$ \sigma \leftarrow \text{sign}(\sigma)\max(|\sigma| - \mu, 0) $$

and store it in $L$, $L = SVD_{\mu}(X - S - \mu^{-1}Y)$

Step 2. Apply the thresholding operator $f(x) = \text{sign}(x)\max(|x| - \lambda \mu, 0) $ with threshold $\lambda\mu$ to the entries of the matrix $X - L + \mu^{-1}Y$

Step 3. Update the matrix $Y$ as $Y \leftarrow Y + \mu(X - L - S)$

A good approach to initialize the parameters is to take $lambda = 1/\max(m,n)$ where $\max(m,n)$ is the max number of rows or columns of the data matrix. We can then set $\mu = 0.25*(m*n)/(\sum_{i,j}|X_ij|)$. One can then terminate the algorithm when $\|X-L-S\|_F \leq \delta \|X\|_F$ where $\|X\|_F$ is the Frobenius norm of the matrix and $\delta$ can be taken for example as $10^{-7}$.

Additional indications: if computing the full SVD from linalg is too expensive, you can replace it with the fast randomized PCA from facebook (see fbpca) or a sparse SVD.

Question 3.1. (8pts) Complete the code below which separates the sparse part from the PCA decomposition¶

In [ ]:

import numpy as np 
from __future__ import division
from scipy.sparse.linalg import svds




def robustPCA(X, delta=1e-6, mu=None, maxiter=500):

    '''
    The function should return a PCA like part stored in 'L' with only a few singular values 
    that are non zero and a sparse sequence 'S' in which the images are black except w very 
    limited number of pixels
    '''
    
    
    # Initialize the tuning parameters.
    lam = # put your value for lambda
    if mu is None:
        
        # complete with your value for mu
        
    # Convergence criterion.
    norm = np.sum(X ** 2)

    # Iterate.
    i = 0
    rank = np.min(shape)
    S = np.zeros(shape)
    Y = np.zeros(shape)
    while i < max(maxiter, 1):
        
        
        # Step 1. Compute and truncate the SVD
        
        

        # Step 2. Truncate the entries of X - L + mu^(-1)Y 
        

        # Step 3. Update the matrix Y
        
        

        # Convergence criterion
        err = np.sqrt(np.sum(step ** 2) / norm)
        if err < delta:
            break
        i += 1

    if i >= maxiter:
        break
        
    return L, S

Question 3.2. (2pts)¶

Apply your function to the escalator sequence and display the result on at least one frame. Use subplot to display the extracted background and its corresponding foreground side by side.

In [ ]:

# put your code here