CSCI-UA 9473 Introduction to Machine Learning

Assignment 1: Gradient descent

Given date: Monday September 19

Due date: Monday October 3

Total: 15pts

Question 1. (5pts) Local vs global minimas and gradient descent

We consider the following function.

\begin{align} F(x_1, x_2) = 3(1-x_1)^2\exp(-(x_1^2) - (x_2+1)^2)\\ - 10(x_1/5 - x_1^3 - x_2^5)\exp(-x_1^2-x_2^2)\\ - (1/3)\exp(-(x_1+1)^2 - x_2^2) \end{align}

The surface plot of this function is given below together with its contour plot. The function has a single global minimum located near $(0.23, -1.62)$ and shown in red in the contour plot.

We want to implement gradient descent iterations on that function. Starting from a random initial point $(x_1, x_2)$, code the following updates

\begin{align} x_1^{(k+1)} = x_1^{(k)} - \eta * \text{grad}_{x_1} F(x_1, x_2)\\ x_2^{(k+1)} = x_2^{(k)} - \eta * \text{grad}_{x_2} F(x_1, x_2) \end{align}

where $\text{grad}_{x_i}$ represents the gradient of $F(x_1, x_2)$ with respect to $x_i$. Choose a sufficiently small learning rate and plot the iterates (in white) on the contour plot. Repeat your experiments for various initial iterates.

In [15]:
from mpl_toolkits.mplot3d import Axes3D  

import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np

fig = plt.figure()
ax = fig.gca(projection='3d')

# Make data.
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
x1, x2 = np.meshgrid(x, y)
F = 3*(1-x1)**2 * np.exp(-(x1**2) - (x2+1)**2)\
   - 10*(np.true_divide(x1,5) - x1**3 - x2**5)*np.exp(-x1**2 - x2**2)\
   - np.true_divide(1,3)*np.exp(-(x1+1)**2 - x2**2)

# Plot the surface.
surf = ax.plot_surface(x1, x2, F, linewidth=0, alpha=1, cmap = 'viridis')
In [27]:
fig1, ax = plt.subplots(constrained_layout=True)
contour = ax.contourf(x1, x2, F,cmap = 'viridis')
plt.scatter(0.23, -1.62,c='r',marker='X')
In [ ]:
# put your solution here

Question 2. (5pts) Regression through the normal equations

We consider the simple regression problem below, similar to the one discussed in class. Find the model that minimizes the sum of squares loss

\begin{align} \ell(\boldsymbol \beta) = \frac{1}{2N}\sum_{i=1}^N (t_{\text{noisy}}^{(i)} - (\beta_0 +\beta_1 x^{(i)}))^2 \end{align}

using the Normal Equations. To do this:

  • Start by building the matrix $\tilde{\boldsymbol X}$ with \begin{align} \tilde{\boldsymbol X} = \left[\begin{array}{cc} 1 & x^{(1)} \\ \vdots & \vdots \\ 1 & x^{(N)} \end{array}\right] \end{align}
  • Then compute the matrix $\tilde{\boldsymbol X}^T\tilde{\boldsymbol X}$ and the vector $\tilde{\boldsymbol X}^T\boldsymbol t$ where $\boldsymbol t = [t_{\text{noisy}}^{(1)}, \ldots , t^{(N)}_{\text{noisy}}]^T$

  • Finally solve the equations through

\begin{align} \boldsymbol \beta_{\text{OLS}} = \left(\tilde{\boldsymbol X}^T\tilde{\boldsymbol X}\right)^{-1}(\tilde{\boldsymbol X}^T\boldsymbol t) \end{align}

using the function np.linalg.inv from the linear algebra package. Plot the result in green on top of the plot below and compare with the true (blue) (unknown) model.

In [29]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,5,10)

noise = np.random.normal(0,.3,len(x))
beta_true = np.random.normal(0,1,2)

t = beta_true[0] + beta_true[1]*x

tnoisy = t+noise

plt.scatter(x, tnoisy, c='r')
plt.plot(x, t)
In [ ]:
# put your code here

Question 3 [5pts]. Successive orthogonalization

We once again consider the dataset from Question 2 but this time with 2 additional features $x_2$ and $x_3$ given below

In [33]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

x = np.linspace(0,5,10)

Xtilde = np.vstack((np.ones((1, len(x))), x))
Xtilde = np.vstack((Xtilde, 2*x))
Xtilde = np.vstack((Xtilde, 4*x +1)).T

noise = np.random.normal(0,.3,len(x))
beta_true = np.random.normal(0,1,4)

t = beta_true[0] + beta_true[1]*Xtilde[:,0] + beta_true[2]*Xtilde[:,1] +\

tnoisy = t+noise

Question 3a [1pt]

Using the linalg package from numpy, check the determinant of the matrix $\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$. What do you notice?

Question 3b [4pts]

We now want to rely on successive orthogonalization to derive the regression coefficients $\mathbf{\beta} = [\beta_0, \beta_1,\ldots \beta_D]$ where $D=3$ of a linear model $y(x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_D x_D$.

Let $\tilde{\mathbf{X}}$ denote the feature matrix, with first column being the vector $[1,1,...,1]$, second column being the first feature $\mathbf{x}_1$, third column being the second feature $\mathbf{x}_2$ and so on.

Starting with $\mathbf{z}_0 = \mathbf{x}_0$ (the first column of $\tilde{\mathbf{X}}$), successive orthogonalization of the columns $\mathbf{x}_j$ of $\tilde{\mathbf{X}}$ can be encoded as follows

  1. Initialize $\mathbf{z}_0 = \mathbf{x}_0 = [1,1,\ldots, 1]^T$
  1. For j=1,2,..., D

    Compute the coefficient $\hat{\gamma}_{\ell,j}$ for $\ell=0,..., j-1$ as

    $$\hat{\gamma}_{\ell, j} = \frac{\langle \mathbf{z}_\ell, \mathbf{x}_j\rangle }{\langle \mathbf{z}_\ell, \mathbf{z}_\ell\rangle}$$

    And define the new column $\mathbf{z}_j$ as

    $$\mathbf{z}_j = \mathbf{x}_j - \sum_{k=0}^{j-1}\hat{\gamma}_{kj}\mathbf{z}_k\quad \quad (*)$$

Once we have all the $\mathbf{z}_j's$ we can compute the projection of the (noisy) target vector $\mathbf{t}_{\varepsilon}$ ('t_noisy') onto each $\mathbf{z}_j$ as $\alpha_j = \langle \mathbf{z}_j, \mathbf{t}_{\varepsilon}\rangle/\langle\mathbf{z}_j,\mathbf{z}_j \rangle$

This then gives us a regression model on the $\mathbf{z}_j$:

\begin{align*} y(\mathbf{z}) = \alpha_0 + \alpha_1 z_1 + \ldots + \alpha_D z_D \end{align*}

where each $z_j$ is a function of the original features (given by the recursion $(*)$)

What we want however is a linear model _on the original features $x_j$_ (not the $z_j$)

Since the $\mathbf{z}_{D}$ is the only $\mathbf{z}_j$ that contains $\mathbf{c}_{D}$, $\beta_{D}$ is given directly from $\alpha_D$ (i.e. $\beta_D = \alpha_D$)

$$\beta_{D} = \alpha_D = \frac{\langle \mathbf{z}_{D}, \mathbf{t}_{\varepsilon}\rangle }{\langle\mathbf{z}_{D},\mathbf{z}_{D} \rangle} $$

The remaining $\beta_j$ can be derived by using the recursion (*) on the $\mathbf{z}_j$, projecting the target vector $\mathbf{t}_{\varepsilon}$ on $\mathbf{z}_j$ and expressing $\mathbf{z}_j$ from the recursion as a linear combination of the $\mathbf{c}_j$

Implement successive orthogonalization on the $3$ feature matrix given above and returns the vector of regression weights $\mathbf{\beta}$.

In [ ]: