**Given date: February 20**

**Due Date: March 8**

**Total: 25pts**

In this assignment you will implement the main regularization approaches as well as cross validation. You will study how the OLS criterion that we used in regression can be extended to classification.

Using the lines below load the dataset 'Assignment2_Ex1_xi' and 'Assignment2_Ex1_ti'. Each of the points in the training set is represented by 5 features $x_{i,1}$, $x_{i,2}, \ldots x_{i,5}$. Among those features we want to find those which are the most meaningful to the description of the targets $t_i$. You can think of the targets as expressing for example the probability to develop a particular trait or disease and the features as encoding the expressivity of particular genes. In such a framework the objective would thus mean finding the genes that most influence the particular trait. For this we will implement a Best Subset Selection approach with cross validation. Complete the cell below by implementing the following steps

**1.** For each number of weights (d=1 to 5) compute all the subsets (beta_i, beta_j, ...) of size d of weights.

**2.** Split the training set in K=5 bins, for each bin k=1,...5, learn the weights by using the linear_regression function of scikit learn (do not reimplement gradient descent except if you really have too much time). Learn the weights on the remaining K-1 bins then comute the MSE on bin k.
**3.** Find the optimal subset of coefficients by comparing the MSE and plot the MSE as a function of the number k of weights by averaging the errors over the size k subsets. I.e MSE(1) = (1/5)(MSE(beta0) + MSE(beta1) + ...MSE(beta4))

In [ ]:

```
import numpy as np
import matplotlib.pyplot as plt
xi = np.load('Assignment2_Ex1_xi.npy')
ti = np.load('Assignment2_Ex1_ti.npy')
D = 5 # number of coefficients
K = 5 # number of bins used for cross validation
# Note that K does not have to be equal to D
# (this is a choice we make here but we could have taken any value for K)
# Step 1: Finding the optimal d
MSE = np.zeros((D,1))
for d in np.arange(0,D-1):
# 1) select each of the (D choose d) subset of coefficient and learn a
# model and compute the MSE
# 2) once you have computed the MSE for each of the
# Step 2 plotting the evolution of the average prediction error as a function of the number of coefficient
```

In this second question, we want to predict admission to graduate school based on a collection of features provided by Kaggle including:

- GRE and TOEFL Scores
- University Rating
- Letter of Recommendation Strength
- Undergraduate GPA
- ...

We want to learn a ridge regression model (use the scikit learn model with the fit and predict functions).

Start by splitting the dataset into a training (about 90%) an a test (remaining 10%) parts using a call to the train_test_split function from the model_selection module. Put the test aside for the rest of the exercise.

Now that you are perfectly comfortable with the idea of cross validation, we will also try to evaluate the optimal lambda in the Ridge regression model. For this, you can use an extension of scikit learn Ridge regression model: sklearn.linear_model.RidgeCV. This extension lets you specify an array of $\lambda$ values ($\alpha$ in scikit learn) to try. The best value is then returned through a call to the 'alpha_' attribute of the model (read the documentation for more details). Train the model (both lambda and beta) on the training subset of item 1.

Finally evaluate the prediction of your model on the 10% test set you kept on the side at the beginning.

In [12]:

```
import pandas as pd
df = pd.read_csv("Admission_Predict.csv")
# put your code here
from sklearn.linear_model import RidgeCV
```

An alternative to the simple OLS criterion or to the Ridge regression model, LASSO regression minimizes a combination of a data fidelity term and a penalty on the sum of the absolute values of the regression coefficients, i.e.

\begin{align} \ell(\boldsymbol \beta ) = \frac{1}{N}\sum_{i=1}^N (t^{(i)} - (\boldsymbol{\beta}^T \boldsymbol x^{(i)}))^2 + \lambda \sum_{j=0}^D |\beta_j|, \quad (\text{LASSO}) \end{align}One of the main difficulty with the LASSO lies in the non differentiability of the absolute value which appears in the regularization term. Because of the use of the absolute value, the gradient cannot be computed at 0. Instead of relying on gradient updates, we can instead turn to the constrained formulation

\begin{align} \min & \quad \ell(\boldsymbol \beta ) = \frac{1}{N}\sum_{i=1}^N (t^{(i)} - (\boldsymbol{\beta}^T \boldsymbol x^{(i)}))^2\\ \text{subject to}& \quad \sum_{j=0}^D |\beta_j|\leq t \end{align}The drawback with such a formulation is that we now have to solve a constrained problem. A common approach relies on the use of thresholding algorithms and in particular to the class of so-called *iterative shrinkage-thresholding algorithms (ISTA)*. If we write the OLS objective in matrix form as $\ell(\boldsymbol \beta) = \frac{1}{2N}\|\tilde{\mathbf{X}}\mathbf{\beta} - \mathbf{t}\|_2^2$, Iterative shrinkage-thresholding algorithms are based on the following update :

where $\mathcal{T}$ is the thresholding operator \begin{align} \mathcal{T}_{\alpha}(\mathbf{\beta})_i = \left(|\beta_i| - \alpha\right)_+\text{sign}(\beta_i) \end{align}

Here $\left(|\beta_i| - \alpha\right)_+ = \max\left\{|\beta_i| - \alpha, 0\right\}$ and $\text{sign}(\beta_i)$ denotes the sign of the coefficient $\beta_i$. From the definition above, you can also see that $\lambda \eta$ acts as a threshold on the $\beta_i$. The larger $\lambda$, the more $\beta_i$ will be set to $0$.

In [ ]:

```
def ISTA(beta_init, lamb, eta, X, t):
# function should apply the Iterative Shrinkage
# Thresholding updates, starting from Beta_init and
# for a set of feature vectors stored in matrix X
# with associated targets stored in t.
return beta_ISTA
```

In [23]:

```
X = np.load('Assignment2_Ex32_Xi.npy')
t = np.load('Assignment2_Ex32_ti.npy')
# put your code here
```

We have seen how the OLS objective can be used to learn a regression model. This objective remains in fact absolutely valid in the classification framework. In binary classification, the targets associated to the feature vectors take one of two values (let us say $1$ and $0$ or $+1$ and $-1$). If we want to learn a model that classifies some feature vectors $\mathbf{x}^{(i)}$ as belonging to class $\mathcal{C}_0$ vs $\mathcal{C}_1$ and we are given a training set $C_{0, \text{tr}} = \left\{\mathbf{x}^{(i)}\right\}_{i=1}^{N_0}$ and $C_{1, \text{tr}} = \left\{\mathbf{x}^{(j)}\right\}_{j=1}^{N_1}$, we can try to learn a separating plane $\beta_0 +\beta_1 x_1 + \ldots \beta_D x_D$ such that $\beta_0 +\beta_1 x^{(i)}_1 + \ldots \beta_D x^{(i)}_D =+1 $ for all $x^{(i)}\in C_0$ and $\beta_0 +\beta_1 x^{(j)}_1 + \ldots \beta_D x^{(j)}_D =-1$ for all $x^{(j)}$ in $\mathcal{C}_1$.

For any new point $\mathbf{x}$ of unknown class, we can then compute $\beta_0 +\beta_1x_1 + \ldots +\beta_D x_D$ and classify our point as belonging to $C_0$ if $\beta_0 +\beta_1x_1 + \ldots +\beta_D x_D>0$.

Combine this idea with the linear regression model from scikit learn to learn a linear binary classifier for the 'Titanic' dataset from Kaggle. Start by loading the training and test data from this dataset and then complete the cell below.

In [ ]:

```
import pandas as pd
from sklearn.linear_model import LinearRegression
training_data = pd.read_csv('train.csv')
# Step 1.
# =========================================================================
# Use the linearRegression model from scikit learn with binary
# targets to predict the passengers that will survive and di in the
# case of the sinking of a ship. Start by turning the class targets to
# binary or +1/-1 values. Then turn possible non numeric features to numbers. Finally
# learn the separating plane.
# Step 2.
# =========================================================================
# Validate your model on the test set and compute the fraction of correctly
# classified samples using the function accuracy_score from the sklearn.metrics module
from sklearn.metrics import accuracy_score
training_data = pd.read_csv('test.csv')
```