Additional readings (To go further):

- Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning
- Robust principal component analysis? EJ Candès, X Li, Y Ma, J Wright
- Saharon Rosset, Ji Zhu and Trevor Hastie, Margin Maximizing Loss Functions

The assignment is divided into three parts. In the first part, we will go back to neural networks. You will be asked to build and train a convolutional neural network for image classification. In the second part, we will focus on the max margin classifier and study how such a classifier can be learned by means of gradient descent. Finally, in the last part, we will implement a principal component decomposition of a video sequence to extract moving targets from their background.

In this first question, we will use the Keras API to build and train a convolutional neural network to discriminate between four types of road signs. To simplify we will consider the detection of 4 different signs:

- A '30 km/h' sign (folder 1)
- A 'Stop' sign
- A 'Go straight' sign
- A 'Keep left' sign

An example of each sign is given below.

In [2]:

```
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img1 = mpimg.imread('1/00001_00000_00012.png')
plt.subplot(141)
plt.imshow(img1)
plt.axis('off')
plt.subplot(142)
img2 = mpimg.imread('2/00014_00001_00019.png')
plt.imshow(img2)
plt.axis('off')
plt.subplot(143)
img3 = mpimg.imread('3/00035_00008_00023.png')
plt.imshow(img3)
plt.axis('off')
plt.subplot(144)
img4 = mpimg.imread('4/00039_00000_00029.png')
plt.imshow(img4)
plt.axis('off')
plt.show()
```

In this exercise, you need to build and train a convolutional neural network to discriminate between the four images.

Before building the network, you should start by cropping the images so that they all have a common predefined size (take the smallest size across all images)

We will use a

**Sequential model**from Keras but it will be up to you to define the structure of the convolution net. Initialization of the sequential model can be done with the following line

In [ ]:

```
model = Sequential()
```

- We will use a
**convolutional**architecture. you can add convolutional layers to the model by using the following lines

In [ ]:

```
model.add(Conv2D(num_units, (filter_size1, filter_size2), padding='same',
input_shape=(3, IMG_SIZE, IMG_SIZE),
activation='relu'))
```

for the first layer and

In [ ]:

```
model.add(Conv2D(filters, filter_size, activation, input_shape)
```

for all the others. 'filters' indicate the number of filters you want to use in the convolutional layer. filter*size is the size of each filter and activation is the usual activation that comes on top of the convolution, i.e.
$x*{\text{out}} = \sigma(\text{filter}*\text{input})$. Finally input_shape indicates the size of your input. Note that only the input layer should be given the input size. Subsequent layers will automatically compute the size of their inputs based on previous layers.

On top of the convolutional layers, convolutional neural networks (CNN) also often rely on **Pooling layers**. The addition of such a layer can be done through the following line

In [ ]:

```
model.add(MaxPooling2D(pool_size=(filter_sz1, filter_sz2),strides=None))
```

The *pooling layers* usually come with two parameters: the 'pool size' and the 'stride' parameter. The basic choice for the pool size is (2,2) and the stride is usually set to None (which means it will split the image into non overlapping regions such as in the Figure below). You should however feel free to play a little with those parameters. The **MaxPool operator** considers a mask of size 'pool_size' which is slided over the image by a number of pixels equal to the stride parameters (in x and y, there are hence two translation parameters). for each position of the mask, the output only retains the max of the pixels appearing in the mask (This idea is illustrated below). One way to understand the effect of the pooling operator is that if the filter detects an edge in a subregion of the image (thus returning at least one large value), although a MaxPooling will reduce the number of parameters, it will keep track of this information.

Adding 'Maxpooling' layers is known to work well in practice.

Although it is a little bit up to you to decide how you want to structure the network, a good start is to add a couple (definitely not exceeding 4) combinations (convolution, convolution, Pooling) with increasing number of units (you do every power of two like 16, 32, 128,...).

Once you have stacked the convolutional and pooling layers, you should flatten the output through a line of the form

In [ ]:

```
model.add(Flatten())
```

And add a couple (no need to put more than 2,3) dense fully connected layers through lines of the form

In [ ]:

```
model.add(Dense(num_units, activation='relu'))
```

Since there are four possible signs, you need to **finish your network with a dense layer with 4 units**. Each of those units should output four number between 0 and 1 representing the likelihood that any of the four signs is detected and such that $p_1 + p_2 + p_3 + p_4 = 1$ (hopefully with one probability much larger than the others). For this reason, a good choice for the **final activation function** of those four units is the **softmax** (Why?).

Build your model below.

In [ ]:

```
model = Sequential()
# construct the model using convolutional layers, dense fully connected layers and
```

Once you have found a good architecture for your network, split the dataset, by retaining about 90% of the images for training and 10% of each folder for test. To train your network in Keras, we need two more steps. The first step is to set up the optimizer. Here again it is a little bit up to you to decide how you want to set up the optimization. Two popular approaches are **SGD and ADAM**. You will get to choose the learning rate. This rate should however be between 1e-3 and 1e-2. Once you have set up the optimizer, we need to set up the optimization parameters. This includes the loss (we will take it to be the **categorical cross entropy** which is the extension of the log loss to the multiclass problem).

In [ ]:

```
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam
# set up the optimize here
# Myoptimizer = SGD
# Myoptimizer = Adam
model.compile(loss='categorical_crossentropy',
optimizer=Myoptimizer,
metrics=['accuracy'])
```

The last step is to fit the network to your data. Just as any function in scikit-learn, we use a call to the function 'fit'. The training of neural networks can be done by splitting the dataset into minibatches and using a different batch at each SGD step. This process is repeated over the whole dataset. A complete screening of the dataset is called an epoch. We can then repeat this idea several times. In keras the number of epochs is stored in the 'epochs' parameter and the batch size is stored in the 'batch_size' parameter.

In [ ]:

```
batch_size = 32
epochs = 30
model.fit(X, t,batch_size=batch_size,epochs=epochs, validation_split=0.2)
```

Consider the dataset below. We would like to learn a classifier for this dataset that maximizes the margin (i.e. such that the distance between the closest points to the classifier is maximized). We have seen that one can solve this problem by means of the constrained formulation

\begin{align*} \min_{\mathbf{\beta}} \quad & \|\mathbf{\beta}\|^2 \\ \text{subject to} \quad & y(\mathbf{x}^{(i)})t^{(i)} \geq 1 \end{align*}where $y(\mathbf{x}^{(i)}) = \mathbf{\beta}^T\mathbf{x}^{(i)} + \beta_0$. We might sometimes want to use a (softer) unconstrained formulation. in particular, when selecting this option, we can use the following function known as the *Hinge loss*

For such a loss, we can derive a softer, unconstrained version of the problem as

\begin{align*} \min_{\mathbf{\beta}} \quad & \|\mathbf{\beta}\|^2 + \frac{C}{N}\sum_{i=1}^N \max(0, 1-t^{(i)}(\mathbf{\beta}^T\mathbf{x}^{(i)}+\beta_0)) \end{align*}In short we penalize a point, only if this point lies on the wrong side of the plane.

In [2]:

```
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
pointsClass1 = loadmat('KernelPointsEx4class1.mat')['PointsEx4class1']
pointsClass2 = loadmat('KernelPointsEx4class2.mat')['PointsEx4class2']
plt.scatter(pointsClass1[:,0], pointsClass1[:,1], c='r')
plt.scatter(pointsClass2[:,0], pointsClass2[:,1], c='b')
plt.show()
```

Start by completing the function below which should return the value and gradient of the hinge loss at a point $\mathbf{x}^{(i)}$. What is the gradient of the hinge loss?

In [ ]:

```
def HingeLoss(x):
'''Returns the value and gradient of the hinge
loss at the point x'''
return value, gradient
```

Once you have the function, implement a function HingeLossSVC that takes as innput a starting weight vector $\mathbf{\beta}$ and intercept $\beta_0$ as well as the set of training points and a value for the parameter $C$ and returns the maximum margin classifier.

In [ ]:

```
def HingeLossSVC(beta_init, beta0_init training, C):
'''Returns the maximal margin classifier for the
training dataset'''
return beta, beta0
```

Principal Component Analysis (PCA) retains an approximation of an original dataset $X$ by focusing on the largest singular values. Such an order $K$ approximation can be obtained from the singular value decomposition $\boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$ by truncating $\boldsymbol U$ to the first $K$ columns, retaining the $K\times K$ diagonal matrix $\boldsymbol \Sigma_k$ as well as the first $K$ rows of $\boldsymbol V^T$, $\boldsymbol V_k^T$, and writing the approximation as $\boldsymbol U_k \boldsymbol \Sigma_k \boldsymbol V^T_k$. This approach is particularly efficient when each of the feature vectors (or images in this case) are close to each other. When there are sharp variations across images, such as when an object appears, move throughout the images and then disapears, a simple PCA does not suffice anymore and one might want to extend it to something more robust. The escalator sequence below is an example of such sequence (the movie can be found on github).

In [3]:

```
import numpy as np
import cv2
import matplotlib.pyplot as plt
# read video
import scipy.io
movie = scipy.io.loadmat('escalator_data.mat')
#frame0 =
print(np.shape(movie['X'][:,0]))
plt.imshow(movie['X'][:,0].reshape((160,130)).swapaxes(0, 1), cmap='gray')
plt.show()
```

The idea of Robust PCA is to add a "sparse" component to the traditional PCA decomposition. Given a collection of images that we store as the columns of the matrix $X$, one then looks for a decomposition

\begin{align} \boldsymbol X = \boldsymbol Y + \boldsymbol S \end{align}Where $Y$ is a matrix which contains the original PCA decomposition, and thus encodes the part of the images that remains approximately constant throughout the sequence (background), and $\boldsymbol S$ is the sparse part (i.e a sequence of images that are varying through the sequence but only at a small number of pixel positions in the images (foreground), in other words most of the pixels in $S$ are assumed to be zero). To recover each part one approach is to proceed as follows, see Candes et al.

We let $\mu$ to denote the parameter that controls the amount of dara we want to store in the sparse foreground extraction part, $\boldsymbol S$. A good choice is to take $\mu = n_1n_2/4\|\mathbf{X}\|_1$ where $\mathbf{X}\in\mathbb{R}^{n_1\times n_2}$ encodes the video surveillance sequence (each frame being stacked as a column of $\mathbf{X}$) and $\|\mathbf{X}\|_1$ denotes the entry wise $\ell_1$ norm of $\mathbf{X}$, $\mathbf{X} = \sum_{ij} |X_{ij}|$

The algorithm then proceeds as follows

**Initialize** $Y$, $S$ to $0$

**Step 1.** Compute the truncated SVD of the matrix $X - S - \mu^{-1}Y$, i.e. let $X - S - \mu^{-1}Y = U\Sigma V^T$. The truncated SVD is then obtained by replacing the diagonal matrix of singular values with the truncation
$$
\sigma \leftarrow \text{sign}(\sigma)\max(|\sigma| - \mu, 0)
$$

and store it in $L$, $L = SVD_{\mu}(X - S - \mu^{-1}Y)$

**Step 2.** Apply the thresholding operator $f(x) = \text{sign}(x)\max(|x| - \lambda \mu, 0) $ with threshold $\lambda\mu$ to the entries of the matrix $X - L + \mu^{-1}Y$ and store the result in $\mathbf{S}$.

**Step 3.** Update the matrix $Y$ as $Y \leftarrow Y + \mu(X - L - S)$

A good approach to initialize the parameters is to take $lambda = 1/\max(n_1,n_2)$ where $\max(n_1,n_2)$ is the max number of rows or columns of the data matrix. One can also terminate the algorithm when $\|X-L-S\|_F \leq \delta \|X\|_F$ where $\|X\|_F$ is the Frobenius norm of the matrix and $\delta$ can be taken for example as $10^{-7}$.

Additional indications: if computing the full SVD from linalg is too expensive, you can replace it with the fast randomized PCA from facebook (see fbpca) or a sparse SVD.

In [ ]:

```
import numpy as np
from __future__ import division
from scipy.sparse.linalg import svds
def robustPCA(X, delta=1e-6, mu=None, maxiter=500):
'''
The function should return a PCA like part stored in 'L' with only a few singular values
that are non zero and a sparse sequence 'S' in which the images are black except w very
limited number of pixels
'''
# Initialize the tuning parameters.
lam = # put your value for lambda
if mu is None:
# complete with your value for mu
# Convergence criterion.
norm = np.sum(X ** 2)
# Iterate.
i = 0
rank = np.min(shape)
S = np.zeros(shape)
Y = np.zeros(shape)
while i < max(maxiter, 1):
# Step 1. Compute and truncate the SVD
# Step 2. Truncate the entries of X - L + mu^(-1)Y
# Step 3. Update the matrix Y
# Convergence criterion
err = np.sqrt(np.sum(step ** 2) / norm)
if err < delta:
break
i += 1
if i >= maxiter:
break
return L, S
```

Apply your function to the escalator sequence and display the result on at least one frame. Use subplot to display the extracted background and its corresponding foreground side by side.

In [ ]:

```
# put your code here
```