**Given date:** June 13

**Due date:** June 21

**Total: 25pts**

As we saw during the lectures, one approach at learning a (binary) linear discriminant is to combine the sigmoid activation function with the linear discriminant $\beta_0 + \mathbf{\beta}^T \mathbf{x}$. We then assume that the probability of having a particular target ($0$ vs $1$) follows a Bernoulli with parameter $\sigma(\tilde{\mathbf{\beta}}^T\tilde{\mathbf{x}})$. i.e. we have

$$\left\{\begin{array}{l} P(t = 1|x) = \sigma(\mathbf{\beta}^T\mathbf{x})\\ P(t = 0|x) = 1-\sigma(\mathbf{\beta}^T\mathbf{x})\end{array}\right.$$The total density can read from the product of each of the independent densities as

$$P(\left\{t_i\right\}_{i=1}^N) = \prod_{i=1}^N \sigma(\mathbf{\beta}^T\mathbf{x})^{t^{(i)}}(1-\sigma(\mathbf{\beta}^T\mathbf{x}))^{1-t^{(i)}}$$we can then take the log and compute the derivatives of the resulting expression with respect to each weight $\beta_j$. Implement this approach below. Recall that the derivative of the sigmoid $\sigma(\boldsymbol x)$ has a *simple expression*.

In [ ]:

```
# Step 1 define the sigmoid activation and its derivative
def sigmoid(x):
'''the function should return the sigmoid and its derivative at all the
entries of x '''
return sig, deriv_sig
def solve_logisticRegression(xi, ti, beta0, maxIter, eta):
'''The function should return the vector of weights for a
logistic regression classifier learned through gradient descent
iterations applied to the log likelihood function'''
return beta
```

An interesting aspect of the MLE estimator in logistic regression (as opposed to other objective functions) is that the Hessian is positive definite. We can thus improve the iterations by using a second order method (such as Newton's method) where the simpler gradient iterations $ \mathbf{\boldsymbol \beta}^{k+1}\leftarrow \mathbf{\boldsymbol \beta}^k - \eta\nabla \ell(\mathbf{\boldsymbol \beta}^k)$ are replaced by

$$\mathbf{\boldsymbol \beta}^{k+1}\leftarrow \mathbf{\boldsymbol \beta}^k - \eta H^{-1}({{\boldsymbol \beta}^k})\nabla \ell(\mathbf{\boldsymbol \beta}^k)$$(see e.g. here for more details) Start by completing the function 'HessianMLE' below which should return the Hessian of the negative log likelihood.

In [ ]:

```
def HessianMLE(beta, xi, ti):
'''Function should return the Hessian (see https://en.wikipedia.org/wiki/Hessian_matrix)
of the log likelihood at a particular value of the weights beta'''
return HessianMatrix
```

In [ ]:

```
def Fisher_scoring(beta0, xi, ti, maxIter, eta):
'''Function should compute the logistic regression classifier by relying on Fisher scoring
iterates should start at beta0 and be applied with a learning eta'''
while numIter<maxIter:
hessian_beta = HessianMLE(beta)
# if no zero eigenvalue
invHessian = # complete
# else
print('Error')
betaNext = betaPrevious - eta*np.matmul(invHessian,gradient)
return optimal_beta
```

Compare the simple (first order) gradient iterations with the (second order) Fisher iterations for the dataset given below. Plot the evolution of the log likelihood through the iterations, for both methods.

In [6]:

```
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
class1 = scipy.io.loadmat('class1HW1_LR.mat')['class1']
class2 = scipy.io.loadmat('class2HW1_LR.mat')['class2']
targets_class1 = np.ones(np.shape(class1)[0])
targets_class0 = np.zeros(np.shape(class0)[0])
plt.scatter(class1[:,0], class1[:,1], c = 'r')
plt.scatter(class2[:,0], class2[:,1], c = 'b')
plt.show()
```

In this second question, we will use the Keras API to build and train a convolutional neural network to discriminate between four types of road signs. To simplify we will consider 4 different signs:

- A '30 km/h' sign (folder 1)
- A 'Stop' sign
- A 'Go straight' sign
- A 'Keep left' sign

An example of each sign is given below.

In [2]:

```
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img1 = mpimg.imread('1/00001_00000_00012.png')
plt.subplot(141)
plt.imshow(img1)
plt.axis('off')
plt.subplot(142)
img2 = mpimg.imread('2/00014_00001_00019.png')
plt.imshow(img2)
plt.axis('off')
plt.subplot(143)
img3 = mpimg.imread('3/00035_00008_00023.png')
plt.imshow(img3)
plt.axis('off')
plt.subplot(144)
img4 = mpimg.imread('4/00039_00000_00029.png')
plt.imshow(img4)
plt.axis('off')
plt.show()
```

In this first part, we will set up the convolutional net step by step.

Before building the network, you should start by cropping the images so that they all have a common predefined size (take the smallest size across all images)

We will use a

**Sequential model**from Keras but it will be up to you to define the final structure of the network. The construction of a sequential model should be started with the following line

In [ ]:

```
from tensorflow.keras import Sequential
model = Sequential()
```

- We will use a
**convolutional**architecture. you can add convolutional layers to the model by using the following lines

In [ ]:

```
model.add(Conv2D(num_units, (filter_size1, filter_size2), padding='same',
input_shape=(3, IMG_SIZE, IMG_SIZE),
activation='relu'))
```

for the first layer and

In [ ]:

```
model.add(Conv2D(filters, filter_size, activation, input_shape)
```

for all the other layers. The 'filters' parameter indicates the number of filters you want to use in the layer. 'filter_size' encodes the size of each filter and 'activation' can be used to specify the activation function that will be applied to the output of the layer, i.e.

$$x_{\text{out}} = \sigma(\text{filter}*\text{input}).$$Finally 'input_shape' encodes the size of the input. Note that the input layer is the only layer for which the input size should be explicitely specified. Subsequent layers will automatically compute the size of their inputs based on the outputs of the previous layers.

On top of the convolutional layers, convolutional neural networks (CNN) also involve **Pooling layers**. The addition of such layer can be done through the following line

In [ ]:

```
model.add(MaxPooling2D(pool_size=(filter_sz1, filter_sz2),strides=None))
```

The **pooling layers** come with two parameters: the 'pool size' and the 'stride'. The basic choice for the pool size is (2,2) and the stride is usually set to None (which means it will split the image into non overlapping regions such as in the Figure below). You should however feel free to play a little with those parameters. A **Max Pooling operator** slides a mask of size 'pool_size' over the image by a number of pixels equal to the stride parameters (in x and y, there are hence two translation parameters). for each position of the mask, the output then returns the max of the pixels appearing in the mask (again, see the Figure below). One way to understand the effect of a pooling operator is that when the filter detects an edge in a subregion of the image (thus returning at least one large value), although the MaxPooling operation will reduce the resolution, it will keep track of this information.

Adding 'Maxpooling' layers is known to work well in practice for image processing tasks.

Once you have stacked the convolutional and pooling layers, you should flatten the output through a line of the form

In [ ]:

```
model.add(Flatten())
```

In [ ]:

```
model.add(Dense(num_units, activation='relu'))
```

Since there are four possible signs, you need to **finish your network with a dense layer consisting of 4 units**. Each of those units should output a number between 0 and 1 representing the likelihood that any of the four signs is detected. Correspondingly those numbers should satisfy $n_1 + n_2 + n_3 + n_4 = 1$ (hopefully with one $n_i$ larger than the others). For this reason, a good choice for the **final activation function** of those four units is the **softmax** (Why?).

Build your model below.

In [ ]:

```
model = Sequential()
# construct the model using convolutional layers, dense fully connected layers and
```

Once you have found a good architecture for your network, split the dataset, by retaining about 90% of the images for training and 10% for test. To train the network in Keras, we need two more steps. The first step will set up the optimizer. Here again it is up to you to decide how you want to set up the optimization. Two popular approaches are **SGD and ADAM**. You will get to choose the learning rate (although it is a good idea to take it between 1e-3 and 1e-2). Once you have set up the optimizer, you need to specify the loss (we will take it to be the **categorical cross entropy** which is the extension of the log loss to the multiclass problem).

In [ ]:

```
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam
# set up the optimize here
# Myoptimizer = SGD
# Myoptimizer = Adam
model.compile(loss='categorical_crossentropy',
optimizer=Myoptimizer,
metrics=['accuracy'])
```

Our last step will consist in fitting the network to the training set. Just as for any implementation in scikit-learn, we will rely on the function 'fit'. In image processing tasks, the training of convolutional neural networks is usually done by splitting the dataset into minibatches and using a different batch for each SGD iteration. This process is repeated over the whole dataset. A complete screening of the dataset is known as an 'epoch'. The complete training step then repeats several epochs. In keras the number of epochs is stored in the 'epochs' parameter of the function 'fit' and the batch size is stored in the 'batch_size' parameter. Plot the evolution of the loss through the SGD iterations.

In [ ]:

```
from sklearn.model_selection import train_test_split
batch_size = 32
epochs = 30
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.1, random_state=1)
model.fit(X_train, t_train, batch_size=batch_size, epochs=epochs, validation_split=0.15)
model.evaluate(X_test, t_test)
```