Notebook

MidTerm Assignment: notebook 1: Revision¶

Given date : March 30

Due date : April 17

Total : 10 pts¶

Question 1.1. Statistical learning: Maximum likelihood (Total 5pts)¶

This exercise contains a pen and paper part and a coding part. You should submit the pen and paper either in lateX or take a picture of your written solution and join it to the Assignment folder.

We consider the dataset given below. This dataset was generated from a Gaussian distribution with a given mean $\mathbf{\mu} = (\mu_1, \mu_2)$ and covariance matrix $\mathbf{\Sigma} = \left[\begin{array}{cc} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{array}\right]$. We would like to recover the mean and variance from the data. In order to do this, use the following steps:

Write the general expression for the probability (multivariate (2D) Gaussian with diagonal covariance matrix) to observe a single sample
We will assume that the samples are independent and identically distributed so that the probability of observing the whole dataset is the product of the probabilties of observing each one of the samples $\left\{\mathbf{x}^{(i)} = (x_1^{(i)}, x_2^{(i)})\right\}_{i=1}^N$. Write down this probability
Take the negative logarithm of this probability
Once you have taken the logarithm, find the expression for $mu_1, \mu_2$, $\sigma_1$ and $\sigma_2$ by maximizing the probability.

In [66]:

import numpy as np

from scipy.io import loadmat
X = loadmat('dataNotebook1_Ex1.mat')['X']

plt.scatter(X[:,0], X[:,1])
plt.show()

Once you have you estimates for the parameters of the Gaussian distribution, plot the level lines of that distribution onb top of the points by using the lines below.

In [ ]:

import matplotlib.pyplot as plt
from matplotlib import cm
x1 = np.linspace(0, 1.85, 100)
x2 = np.linspace(0.25, 2.5, 100)

xx1, xx2 = np.meshgrid(x1, x2)

from scipy.stats import multivariate_normal

xmesh = np.vstack((xx1.flatten(), xx2.flatten())).T

mu1 = # complete with your value
mu2 = # complete with your value
sigma1 = # complete with your value
sigma2 = # complete with your value
sigma = np.zeros((2,2))
sigma[0,0] = sigma1**2
sigma[1,1] = sigma2**2

y = multivariate_normal.pdf(xmesh, mean=[mu1, mu2], cov=sigma)

What you should obtain:¶

In [77]:

plt.scatter(X[:,0], X[:,1])
plt.contourf(xx1, xx2, np.reshape(y, (100, 100)), zdir='z', offset=-0.15, cmap=cm.viridis, alpha=0.5)
plt.show()

1.2. We consider the following linear regression problem. (Total 5pts)¶

In [78]:

import numpy as np
import matplotlib.pyplot as plt

from scipy.io import loadmat
X = loadmat('MidTermAssignment_dataEx2.mat')['MidTermAssignment_dataEx2']

plt.scatter(X[:,0], X[:,1])
plt.show()

Solve the $\ell_2$ regularized linear regression problem through the normal equations (be careful that you have to take the $\ell_2$regularization into account). Then double-check your solution by comparing it with the regression function from scikit learn. Plot the result below.

In [ ]:

2.3. Kernel Ridge regression. Given the 'Normal Equations' solution to the regularized regression model, we now want to turn the regression model into a formulation over kernels.

2.3.1. Start by showing (one line) that this solution can read as

$$\mathbf{\beta} = \mathbf{X}^T\left(\mathbf{K} + \lambda\mathbf{I}_N\right)^{-1}\mathbf{t}$$

where $\mathbf{K}$ is the kernel matrix defined from the scalar product of the prototypes, i.e. $\mathbf{K}_{i,j} = \kappa(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = (\mathbf{x}^{(i)})^T(\mathbf{x}^{(j)})$.

2.3.2. Given this, the classifier can read as $f(\mathbf{x}) = \mathbf{\beta}^T\mathbf{x} = \sum_{i=1}^N \alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$. What are the $\alpha$ in this case?

2.3.3. We will apply this idea to text data. Using kernels with text data is interesting because it is usually easier to compare documents than to find appropriate features to represent those documents. The file 'headlines_train.txt' contains a few headlines, some of them being about finance, others being about weather forecasting. Use the first group of lines below to load those lines and their associated targets (1/0).

In [ ]:

# Start by loading the file using the lines below 
import numpy as np

f = open('headlines_train.txt', "r")
lines = f.readlines()

f.close()

sentences = ['Start']
target = [0]

for l in np.arange(len(lines)-2):
    
    if l%2 == 0:
        
        
        lines_tmp = lines[l]
        lines_tmp = lines_tmp[:-1]
        sentences.append(lines_tmp)
        if lines_tmp[-1] == ' ':
            target.append(float(lines_tmp[-2]))
        else:
            target.append(float(lines_tmp[-1]))
        
sentences = sentences[1:]
target = target[1:]

2.3.4. Now use the lines below to define the kernel. The kernel is basically built by generating a TF-IDF vector for each sentence and comparing those sentences through a cosine similarity measure. the variable 'kernel' the kernel matrix, i.e. $\kappa(i,j) = \frac{\phi_i^T\phi_j}{\|\phi_i\|\|\phi_j\|}$ where the $\phi_i$ encodes the tf-idf vectors. Use the lines below to compute the kernel matrix.

In [ ]:

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(max_features=100, stop_words='english',
                                 decode_error='ignore')

tfidf = vect.fit_transform(sentences)

from sklearn.metrics.pairwise import cosine_similarity
kernel = cosine_similarity(tfidf)

import matplotlib.pyplot as plt
plt.imshow(kernel)
plt.show()

2.3.4. Once you have the kernel matrix, compute the weights $\alpha$ of the classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$.

In [60]:

# compute the alpha weights

2.3.5. Now that you have the weights, we want to apply the classifier to a few new headlines. Those headlines are stored in the file 'headlines_test.txt'. Use the lines below to load those sentences and compute their TF-IDF representation. the classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$

In [61]:

# Start by loading the file using the lines below 
import numpy as np

f = open('headlines_test.txt', "r")
lines = f.readlines()

f.close()

sentences_test = ['Start']

for l in np.arange(len(lines)):
    
    if l%2 == 0:
        
        lines_tmp = lines[l]
        lines_tmp = lines_tmp[:-1]
        sentences_test.append(lines_tmp)
        
sentences_test = sentences_test[1:]

tfidf_test = vect.transform(sentences_test)

test_F = np.hstack((tfidf_test.todense(), np.zeros((4, 100-np.shape(tfidf_test.todense())[1]))))

2.3.6. Once you have the tf-idf representations stored in the matrix test_F (size 4 by 100 features) the value $\kappa(\mathbf{x}, \mathbf{x}_i)$ that you need to get the final classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$ and hence the target of the new sentences, you need to compute the cosine similarity of the new "test" tf-idf vectors with the "training" tf-idf vectors which you computed earlier. each of those cosine similarities will give you an entry in $\kappa(\mathbf{x}, \mathbf{x}_i)$ (here $\mathbf{x}$ denotes any of the fixed test sentences). once you have those similarities, compute the target from your $\alpha$ values as $t(\mathbf{x}) = \sum_{i\in \text{train}} \alpha_i\kappa(\mathbf{x}, \mathbf{x}_i)$. print those targets below.

In [ ]: