## Introduction to Machine Learning: Assignment 2

Given date: Monday September 27

Due date: Friday October 15

#### Question 1 Least Absolute Shrinkage and Selection Operator¶

(13pts)

Learning a model through the OLS loss can be done very efficiently through either gradient descent or even through the Normal equations. The same is true for ridge regression. For the Lasso formulation however, the non differentiability of the absolute value at $0$ makes the learning more tricky.

One approach, known as ISTA (see Amir Beck and Marc Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems) consists in combining traditional gradient descent steps with a projection onto the $\ell_1$ norm ball. Concretely, for the LASSO objective

\begin{align} \ell(\boldsymbol \beta) = \|\boldsymbol X\boldsymbol \beta - \boldsymbol t\|^2_2 + \lambda \|\boldsymbol \beta\|_1 \end{align}

where $\boldsymbol \beta = (\beta_1, \beta_2,\ldots, \beta_D)$ (note that we don't include the bias) and the feature vectors $\left\{\boldsymbol x_i\right\}_{i=1}^N$ (corresponding to the rows of the matrix $\boldsymbol X$) as well as the targets $t_i$ are assumed to be centered, i.e. \begin{align} \boldsymbol x_{ij} \leftarrow \boldsymbol x_{ij}- \frac{1}{N}\sum_{i=1}^{N} x_{ij}\\ t_i \leftarrow t_i - \frac{1}{N}\sum_{i=1}^N t_i \end{align}

(Note that this is equivalent to taking $\beta_0 = \frac{1}{N}\sum_{i=1}^N t_i$) The ISTA update takes the form

\begin{align} \boldsymbol \beta^{k+1} \leftarrow \mathcal{T}_{\lambda \eta} (\boldsymbol \beta^{k} - 2\eta \mathbf{X}^T(\mathbf{X}\mathbf{\beta} - \mathbf{t})) \end{align}

where $\mathcal{T}_{\lambda \eta}(\mathbf{x})_i$ is the thresholding operator defined component-wise as

\begin{align} \mathcal{T}_{\lambda t}(\mathbf{\beta})_i = (|\beta_i| - \lambda t)_+ \text{sign}(\beta_i) \end{align}

In the equations above, $\eta$ is an appropriate step size and $(x)_+ = \max(x, 0)$

##### Question 1.1. (5pts)¶

Complete the function 'ISTA' which must return a final estimate for the regression vector $\mathbf{\beta}$ given a feature matrix $\mathbf{X}$, a target vector $\mathbf{t}$ (the function should include the centering steps for $\mathbf{x}_i$ and $t_i$) regularization weight $\lambda$, and the choice for the learning rate $\eta$.

In [ ]:
def ISTA(beta_init, X, t, lbda, eta):

'''The function takes as input an initial guess for beta, a set
of feature vectors stored in X and their corresponding
targets stored in t, a regularization weight lbda,
step size parameter eta and must return the
regression weights following from the minimization of
the LASSO objective'''

beta_LASSO = np.zeros((np.shape(X), 1))

return beta_LASSO


##### Question 1.2. (3pts)¶

Apply your algorithm to the data (in red) given below for polynomial features up to degree 8-10 and for various values of $\lambda$. Display the result on top of the true model (in blue).

In :
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,1,20)
xtrue = np.linspace(0,1,100)
t_true = 0.1 + 1.3*xtrue

t = 0.1 + 1.3*x

tnoisy  = t+np.random.normal(0,.1,len(x))

plt.scatter(x, tnoisy, c='r')
plt.plot(xtrue, t_true)

plt.show() ##### Question 1.3 FISTA (3pts)¶

It is possible to improve the ISTA updates by combining them with Nesterov accelerated gradient descent. The resulting update, known as FISTA can read, for a constant step size, by letting $\mathbf{y}^{(1)} = {\boldsymbol \beta}^{(0)}$, $\eta^1 = 1$ and then using (for $k \geq 1$)

\begin{align} \left\{ \begin{array}{l} &\boldsymbol{\beta}^{k} = \text{ISTA}(\mathbf{y}^{k})\\ &\eta^{(k+1)} = \frac{1+\sqrt{1+4(\eta^{(k)})^2}}{2}\\ &\mathbf{y}^{(k+1)} = \mathbf{\beta}^{(k)} + \left(\frac{\eta^{(k)} - 1}{\eta^{(k+1)}}\right)\left({\boldsymbol\beta}^{(k)} - {\boldsymbol\beta}^{(k-1)}\right)\end{array}\right. \end{align}

Here $\text{ISTA}$ denotes a single ISTA update.

Complete the function below so that it performs the FISTA iterations. Then apply it to the data given in question 1.2.

In [ ]:
def FISTA(X, t, eta1, beta0, lbda):

'''function should return the solution to the minimization of the
the LASSO objective ||X*beta - t||_2^2 + lambda*||beta||_1

return beta_LASSO

##### Question 1.4. (2pts)¶

Compare the ISTA and FISTA updates by plotting the evolution of the loss $\ell(\mathbf{\beta})$ as a function of the iterations for both approaches. Take a sufficient number of iterations (1000 - 10,000)

In [ ]:
import matplotlib.pyplot as plt

FISTA_loss = np.zeros((max_iter, 1))
ISTA_loss = np.zeros((max_iter, 1))

plt.plot(FISTA_loss)
plt.plot(ISTA_loss)
plt.show()


#### Question 2. Classification Trees¶

(10pts)

When the data is not easily separable, it remains possible to learn a tree based classifier. For a dataset that consists of pairs $\left\{(\mathbf{x}_i, t_i)\right\}_{i=1, \ldots, N}$ where $\mathbf{x}_i = (x_{i1}, x_{i2})\in \mathbb{R}^2$. We start with all the data in a single class. We then consider a split point $(j,s)$ such that it defines the two regions $R_1(j,s) = \left\{x|x_j\leq s\right\}$ and $R_2(j, s) = \left\{x|x_j>s\right\}$

In general we seek the splitting variables $j$ and split point $s$ that solve \begin{align} \min_{j, s} \left[\min_{c_1} \sum_{\mathbf{x}_i\in R_1(j, s)} (t_i - c_1)^2 + \min_{c_2} \sum_{\mathbf{x}_i\in R_2(j, s)} (t_i - c_2)^2\right] \end{align}

For which the inner minimization is solved by \begin{align} \hat{c}_1 = \text{average}(t_i|\mathbf{x}_i \in R_1(j, s)) \end{align} and \begin{align} \hat{c}_2 = \text{average}(t_i|\mathbf{x}_i \in R_2(j, s)) \end{align}

Having found the best split, we partition the data into the corresponding two regions and repeat the splitting process on each of the two regions. Note that $s$ should always be taken amond the training points.

##### Question 2.1. (10pts)¶

Build a classification tree for the dataset given below with a stopping criterion corresponding to a target that is constant (or almost) across regions. Then plot the result with meshgrid and contourf.

In :
import matplotlib.pyplot as plt
import numpy as np
import scipy.io 