Given date : March 30
Due date : April 17
This exercise contains a pen and paper part and a coding part. You should submit the pen and paper either in lateX or take a picture of your written solution and join it to the Assignment folder.
We consider the dataset given below. This dataset was generated from a Gaussian distribution with a given mean $\mathbf{\mu} = (\mu_1, \mu_2)$ and covariance matrix $\mathbf{\Sigma} = \left[\begin{array}{cc} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{array}\right]$. We would like to recover the mean and variance from the data. In order to do this, use the following steps:
import numpy as np
from scipy.io import loadmat
X = loadmat('dataNotebook1_Ex1.mat')['X']
plt.scatter(X[:,0], X[:,1])
plt.show()
import matplotlib.pyplot as plt
from matplotlib import cm
x1 = np.linspace(0, 1.85, 100)
x2 = np.linspace(0.25, 2.5, 100)
xx1, xx2 = np.meshgrid(x1, x2)
from scipy.stats import multivariate_normal
xmesh = np.vstack((xx1.flatten(), xx2.flatten())).T
mu1 = # complete with your value
mu2 = # complete with your value
sigma1 = # complete with your value
sigma2 = # complete with your value
sigma = np.zeros((2,2))
sigma[0,0] = sigma1**2
sigma[1,1] = sigma2**2
y = multivariate_normal.pdf(xmesh, mean=[mu1, mu2], cov=sigma)
plt.scatter(X[:,0], X[:,1])
plt.contourf(xx1, xx2, np.reshape(y, (100, 100)), zdir='z', offset=-0.15, cmap=cm.viridis, alpha=0.5)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
X = loadmat('MidTermAssignment_dataEx2.mat')['MidTermAssignment_dataEx2']
plt.scatter(X[:,0], X[:,1])
plt.show()
Solve the $\ell_2$ regularized linear regression problem through the normal equations (be careful that you have to take the $\ell_2$regularization into account). Then double-check your solution by comparing it with the regression function from scikit learn. Plot the result below.
2.3. Kernel Ridge regression. Given the 'Normal Equations' solution to the regularized regression model, we now want to turn the regression model into a formulation over kernels.
2.3.1. Start by showing (one line) that this solution can read as
$$\mathbf{\beta} = \mathbf{X}^T\left(\mathbf{K} + \lambda\mathbf{I}_N\right)^{-1}\mathbf{t}$$where $\mathbf{K}$ is the kernel matrix defined from the scalar product of the prototypes, i.e. $\mathbf{K}_{i,j} = \kappa(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = (\mathbf{x}^{(i)})^T(\mathbf{x}^{(j)})$.
2.3.2. Given this, the classifier can read as $f(\mathbf{x}) = \mathbf{\beta}^T\mathbf{x} = \sum_{i=1}^N \alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$. What are the $\alpha$ in this case?
2.3.3. We will apply this idea to text data. Using kernels with text data is interesting because it is usually easier to compare documents than to find appropriate features to represent those documents. The file 'headlines_train.txt' contains a few headlines, some of them being about finance, others being about weather forecasting. Use the first group of lines below to load those lines and their associated targets (1/0).
# Start by loading the file using the lines below
import numpy as np
f = open('headlines_train.txt', "r")
lines = f.readlines()
f.close()
sentences = ['Start']
target = [0]
for l in np.arange(len(lines)-2):
if l%2 == 0:
lines_tmp = lines[l]
lines_tmp = lines_tmp[:-1]
sentences.append(lines_tmp)
if lines_tmp[-1] == ' ':
target.append(float(lines_tmp[-2]))
else:
target.append(float(lines_tmp[-1]))
sentences = sentences[1:]
target = target[1:]
2.3.4. Now use the lines below to define the kernel. The kernel is basically built by generating a TF-IDF vector for each sentence and comparing those sentences through a cosine similarity measure. the variable 'kernel' the kernel matrix, i.e. $\kappa(i,j) = \frac{\phi_i^T\phi_j}{\|\phi_i\|\|\phi_j\|}$ where the $\phi_i$ encodes the tf-idf vectors. Use the lines below to compute the kernel matrix.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(max_features=100, stop_words='english',
decode_error='ignore')
tfidf = vect.fit_transform(sentences)
from sklearn.metrics.pairwise import cosine_similarity
kernel = cosine_similarity(tfidf)
import matplotlib.pyplot as plt
plt.imshow(kernel)
plt.show()
2.3.4. Once you have the kernel matrix, compute the weights $\alpha$ of the classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$.
# compute the alpha weights
2.3.5. Now that you have the weights, we want to apply the classifier to a few new headlines. Those headlines are stored in the file 'headlines_test.txt'. Use the lines below to load those sentences and compute their TF-IDF representation. the classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$
# Start by loading the file using the lines below
import numpy as np
f = open('headlines_test.txt', "r")
lines = f.readlines()
f.close()
sentences_test = ['Start']
for l in np.arange(len(lines)):
if l%2 == 0:
lines_tmp = lines[l]
lines_tmp = lines_tmp[:-1]
sentences_test.append(lines_tmp)
sentences_test = sentences_test[1:]
tfidf_test = vect.transform(sentences_test)
test_F = np.hstack((tfidf_test.todense(), np.zeros((4, 100-np.shape(tfidf_test.todense())[1]))))
2.3.6. Once you have the tf-idf representations stored in the matrix test_F (size 4 by 100 features) the value $\kappa(\mathbf{x}, \mathbf{x}_i)$ that you need to get the final classifier $y(\mathbf{x}) = \sum_{i\in \mathcal{D}}\alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$ and hence the target of the new sentences, you need to compute the cosine similarity of the new "test" tf-idf vectors with the "training" tf-idf vectors which you computed earlier. each of those cosine similarities will give you an entry in $\kappa(\mathbf{x}, \mathbf{x}_i)$ (here $\mathbf{x}$ denotes any of the fixed test sentences). once you have those similarities, compute the target from your $\alpha$ values as $t(\mathbf{x}) = \sum_{i\in \text{train}} \alpha_i\kappa(\mathbf{x}, \mathbf{x}_i)$. print those targets below.