Notebook

Introduction to Machine Learning, Summer 2022

Assignment 2

Given date: June 13

Due date: June 21

Total: 25pts

Question 1 Logistic regression (15pts)¶

Question 1.1 Logistic regression (5pts)¶

As we saw during the lectures, one approach at learning a (binary) linear discriminant is to combine the sigmoid activation function with the linear discriminant $\beta_0 + \mathbf{\beta}^T \mathbf{x}$. We then assume that the probability of having a particular target ($0$ vs $1$) follows a Bernoulli with parameter $\sigma(\tilde{\mathbf{\beta}}^T\tilde{\mathbf{x}})$. i.e. we have

$$\left\{\begin{array}{l} P(t = 1|x) = \sigma(\mathbf{\beta}^T\mathbf{x})\\ P(t = 0|x) = 1-\sigma(\mathbf{\beta}^T\mathbf{x})\end{array}\right.$$

The total density can read from the product of each of the independent densities as

$$P(\left\{t_i\right\}_{i=1}^N) = \prod_{i=1}^N \sigma(\mathbf{\beta}^T\mathbf{x})^{t^{(i)}}(1-\sigma(\mathbf{\beta}^T\mathbf{x}))^{1-t^{(i)}}$$

we can then take the log and compute the derivatives of the resulting expression with respect to each weight $\beta_j$. Implement this approach below. Recall that the derivative of the sigmoid $\sigma(\boldsymbol x)$ has a simple expression.

In [ ]:

# Step 1 define the sigmoid activation and its derivative

def sigmoid(x):

    '''the function should return the sigmoid and its derivative at all the 
    entries of x '''
    
    return sig, deriv_sig

def solve_logisticRegression(xi, ti, beta0, maxIter, eta):
    
    '''The function should return the vector of weights for a 
    logistic regression classifier learned through gradient descent 
    iterations applied to the log likelihood function'''
    
    return beta

Question 1.2 Logistic regression and Fisher scoring (5pts)¶

An interesting aspect of the MLE estimator in logistic regression (as opposed to other objective functions) is that the Hessian is positive definite. We can thus improve the iterations by using a second order method (such as Newton's method) where the simpler gradient iterations $ \mathbf{\boldsymbol \beta}^{k+1}\leftarrow \mathbf{\boldsymbol \beta}^k - \eta\nabla \ell(\mathbf{\boldsymbol \beta}^k)$ are replaced by

$$\mathbf{\boldsymbol \beta}^{k+1}\leftarrow \mathbf{\boldsymbol \beta}^k - \eta H^{-1}({{\boldsymbol \beta}^k})\nabla \ell(\mathbf{\boldsymbol \beta}^k)$$

(see e.g. here for more details) Start by completing the function 'HessianMLE' below which should return the Hessian of the negative log likelihood.

In [ ]:

def HessianMLE(beta, xi, ti):
    
    '''Function should return the Hessian (see https://en.wikipedia.org/wiki/Hessian_matrix) 
    of the log likelihood at a particular value of the weights beta'''
    
    return HessianMatrix

Then complete the function 'Fisher_scoring' which should learn a logistic regression classifier based on the second order Fisher iterations.

In [ ]:

def Fisher_scoring(beta0, xi, ti, maxIter, eta):
    
    '''Function should compute the logistic regression classifier by relying on Fisher scoring
    iterates should start at beta0 and be applied with a learning eta'''

    while numIter<maxIter:
    
        hessian_beta = HessianMLE(beta)
        
        # if no zero eigenvalue
        
            invHessian = # complete 
        
        # else 
        
            print('Error')
        
        betaNext = betaPrevious - eta*np.matmul(invHessian,gradient)
    
    return optimal_beta

Question 1.3 Comparing the two approaches. (5pts)¶

Compare the simple (first order) gradient iterations with the (second order) Fisher iterations for the dataset given below. Plot the evolution of the log likelihood through the iterations, for both methods.

In [6]:

import numpy as np
import matplotlib.pyplot as plt

import scipy.io
class1 = scipy.io.loadmat('class1HW1_LR.mat')['class1']
class2 = scipy.io.loadmat('class2HW1_LR.mat')['class2']

targets_class1 = np.ones(np.shape(class1)[0])
targets_class0 = np.zeros(np.shape(class0)[0])

plt.scatter(class1[:,0], class1[:,1], c = 'r')
plt.scatter(class2[:,0], class2[:,1], c = 'b')
plt.show()

Question 2. Convolutional nets and autonomous driving (10pts)¶

In this second question, we will use the Keras API to build and train a convolutional neural network to discriminate between four types of road signs. To simplify we will consider 4 different signs:

A '30 km/h' sign (folder 1)
A 'Stop' sign
A 'Go straight' sign
A 'Keep left' sign

An example of each sign is given below.

In [2]:

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img1 = mpimg.imread('1/00001_00000_00012.png')
plt.subplot(141)
plt.imshow(img1)
plt.axis('off')
plt.subplot(142)
img2 = mpimg.imread('2/00014_00001_00019.png')
plt.imshow(img2)
plt.axis('off')
plt.subplot(143)
img3 = mpimg.imread('3/00035_00008_00023.png')
plt.imshow(img3)
plt.axis('off')
plt.subplot(144)
img4 = mpimg.imread('4/00039_00000_00029.png')
plt.imshow(img4)
plt.axis('off')
plt.show()

Question 2.1. (Constructing the network 5pts)¶

In this first part, we will set up the convolutional net step by step.

Before building the network, you should start by cropping the images so that they all have a common predefined size (take the smallest size across all images)
We will use a Sequential model from Keras but it will be up to you to define the final structure of the network. The construction of a sequential model should be started with the following line

In [ ]:

from tensorflow.keras import Sequential
model = Sequential()

2.1.a. Convolutions.¶

We will use a convolutional architecture. you can add convolutional layers to the model by using the following lines

In [ ]:

model.add(Conv2D(num_units, (filter_size1, filter_size2), padding='same',
                             input_shape=(3, IMG_SIZE, IMG_SIZE),
                             activation='relu'))
                                        

for the first layer and

In [ ]:

model.add(Conv2D(filters, filter_size, activation, input_shape)

for all the other layers. The 'filters' parameter indicates the number of filters you want to use in the layer. 'filter_size' encodes the size of each filter and 'activation' can be used to specify the activation function that will be applied to the output of the layer, i.e.

$$x_{\text{out}} = \sigma(\text{filter}*\text{input}).$$

Finally 'input_shape' encodes the size of the input. Note that the input layer is the only layer for which the input size should be explicitely specified. Subsequent layers will automatically compute the size of their inputs based on the outputs of the previous layers.

2.1.b Pooling Layers¶

On top of the convolutional layers, convolutional neural networks (CNN) also involve Pooling layers. The addition of such layer can be done through the following line

In [ ]:

 model.add(MaxPooling2D(pool_size=(filter_sz1, filter_sz2),strides=None))

The pooling layers come with two parameters: the 'pool size' and the 'stride'. The basic choice for the pool size is (2,2) and the stride is usually set to None (which means it will split the image into non overlapping regions such as in the Figure below). You should however feel free to play a little with those parameters. A Max Pooling operator slides a mask of size 'pool_size' over the image by a number of pixels equal to the stride parameters (in x and y, there are hence two translation parameters). for each position of the mask, the output then returns the max of the pixels appearing in the mask (again, see the Figure below). One way to understand the effect of a pooling operator is that when the filter detects an edge in a subregion of the image (thus returning at least one large value), although the MaxPooling operation will reduce the resolution, it will keep track of this information.

Adding 'Maxpooling' layers is known to work well in practice for image processing tasks.

Although it is up to you to decide how you want to structure the network, a good start is to add a couple (definitely not exceeding 4) combinations (convolution, convolution, Pooling) with an increasing number of units per layer (you can for example consider a number of units increasing according to powers of 2 such as 16, 32, 128,...).

2.1.c. Flattening and Fully connected layers¶

Once you have stacked the convolutional and pooling layers, you should flatten the output through a line of the form

In [ ]:

model.add(Flatten())

And add a couple (no need to put more than 2,3) dense fully connected layers through lines of the form

In [ ]:

model.add(Dense(num_units, activation='relu'))

2.1.d. Concluding¶

Since there are four possible signs, you need to finish your network with a dense layer consisting of 4 units. Each of those units should output a number between 0 and 1 representing the likelihood that any of the four signs is detected. Correspondingly those numbers should satisfy $n_1 + n_2 + n_3 + n_4 = 1$ (hopefully with one $n_i$ larger than the others). For this reason, a good choice for the final activation function of those four units is the softmax (Why?).

Build your model below.

In [ ]:

model = Sequential()

# construct the model using convolutional layers, dense fully connected layers and 

Question 2.2. Setting up the optimizer (3pts).¶

Once you have found a good architecture for your network, split the dataset, by retaining about 90% of the images for training and 10% for test. To train the network in Keras, we need two more steps. The first step will set up the optimizer. Here again it is up to you to decide how you want to set up the optimization. Two popular approaches are SGD and ADAM. You will get to choose the learning rate (although it is a good idea to take it between 1e-3 and 1e-2). Once you have set up the optimizer, you need to specify the loss (we will take it to be the categorical cross entropy which is the extension of the log loss to the multiclass problem).

In [ ]:

from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam

# set up the optimize here
# Myoptimizer = SGD
# Myoptimizer = Adam

model.compile(loss='categorical_crossentropy',
              optimizer=Myoptimizer,
              metrics=['accuracy'])

Question 2.3 Optimization (2pts).¶

Our last step will consist in fitting the network to the training set. Just as for any implementation in scikit-learn, we will rely on the function 'fit'. In image processing tasks, the training of convolutional neural networks is usually done by splitting the dataset into minibatches and using a different batch for each SGD iteration. This process is repeated over the whole dataset. A complete screening of the dataset is known as an 'epoch'. The complete training step then repeats several epochs. In keras the number of epochs is stored in the 'epochs' parameter of the function 'fit' and the batch size is stored in the 'batch_size' parameter. Plot the evolution of the loss through the SGD iterations.

In [ ]:

from sklearn.model_selection import train_test_split
batch_size = 32
epochs = 30
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.1, random_state=1)
model.fit(X_train, t_train, batch_size=batch_size, epochs=epochs, validation_split=0.15)
model.evaluate(X_test, t_test)