What is the rank of the matrix $\begin{pmatrix}1 & 1 & 1 \\ 2 & 2 & 2 \\ 1 & 2 & 1\end{pmatrix}$? Why? Find its skeleton decomposition.
Find rank of the matrix $a_{ij} = (i+j)^2$, $i,j=1,\dots,n$. Find its skeleton decomposition. Note: use interpretation of rank as the minimal number of summarands with separated variables $i$ and $j$.
Interesting remark: matrix norms that have the form $$\|A\|^{\text{shatten}}_p = \left(\sigma_1^p(A) + \dots + \sigma_r^p(A)\right)^{1/p}$$ are called Shatten norms, where $\sigma_i(A)$ are singular values of $A$. Important cases are $p=2$ - the Frobenius norm, $p=\infty$ - the second norm and $p=1$ - the nuclear norm.
Prove that $\|A\|_2 = \sigma_1(A)$ and $\|A\|_F = \sqrt{\sigma_1^2(A) + \dots + \sigma_r^2(A)}$. Note: unitary invariance of the second and Frobenius norms might be helpful.
Find the distance between $A$ and its truncated SVD in the second and in the Frobenius norms.
Suppose $A^* = A$. Find the exact relation between the eigenvalues and singular values of $A$.
Is matrix $\begin{pmatrix}1 & 0 & 1 \\ 0 & 0 & 0 \\ 1 & 0 & 1\end{pmatrix}$ diagonalizable? Why?
Prove that $\lambda(A)\in \mathbb{R}^1$ if $A$ is Hermitian, $\lambda(A)$ has only imaginary part if $A$ is skew-Hermitian and $|\lambda(A)|=1$ if $A$ is unitary.
Is it possible to solve 1 000 000 $\times$ 1 000 000 dense linear system on your laptop within 1 month via LU decomposition provided that the LU decomposition is not given?
What is the complexity of solving a system with a matrix given by its LU or by its QR decomposition. Explain the answer.
Implement the QR factorization using Housholder reflections and the modified Gram-Schimdt algorithm. Make sure that they work on a random matrix. To measure orthogonality use $\|G - I\|$ as in Problem Set 1
Compare the Gram-Schmidt (from the Problem Set 1), modified Gram-Schmidt and the Housholder algorithms on $5\times5$, $10\times 10$ and $20\times 20$ Hilbert matrices
First of all create PageRank matrix $A$ that corresponds to the following graph: Make sure that your matrix is stochastic. What is the largest eigenvalue of this matrix?
Implement power method and plot relative errors ${|\lambda_{k+1} - 1|}$ (since you know that the exact $\lambda$ is $1$) and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ (since you have no information about $x$) as a function of $k$, where $k$ is the interation number. DO NOT FORGET TO USE LOGARITHMIC SCALE!
Create PageRank matrix $A$ that corresponds to the following graph:
Run the power method and plot relative errors ${|\lambda_{k+1} - 1|}$ and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ as a function of $k$. Why do you observe the absense of convergence?
Hint: think about the multiplicity of $\lambda=1$
In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique: $$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$ where $d$ is small parameter (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix which corresponds to the largest eigenvalue has the following interpretation: consider a person who stays in a random node of the graph (i.e. opens a random web page); at each step he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the guy randomly walks through the graph and the eigenvector we are looking for is exactly his stationary distribution — for each node it tells you the probability of visiting this particular node. Therefore if the guy has started from a part of a graph which is not connected with the other part, he will never get there. In the regularized model the person at each step follows one of the outcoming links with probability $d$ OR visits a random node from the whole graph with probability $(1-d)$.
Let us now find the most significant articles on the simple english Wikipedia according to the PageRank model. We provide you with the adjecency matrix of the simple Wikipedia articles (file simple_wiki_matrix.mat
, matrix can be acessed by the key 'W'
) and the dictionary that maps article id to its name on Wikipedia (file simple_wiki_dict.pickle
).
Normalize each column of the adjecency matrix to get a matrix from the PageRank model. Check that the obtained matrix is stochastic.
Plot relative errors ${|\lambda_{k+1} - 1|}$ and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ as a function of $k$ for different $d$. Note that your matrix contains a lot of zeros and is stored in the sparse format. Hence np.dot(A, x)
will not work. However, sparse matrices has method .dot()
, so A.dot(x)
works fine.
Print names of top-5 articles when $d=0.85$.
The aim of this task is to build a face classifier. There are 40 persons in the database. Every person is represented by 10 photos with slightly different facial expression.
Download the database of faces from here
Create training sample:
Import first 9 images for each face ($9\times 40$ images). Represent these pictures as a matrix $F$ with $9\times 40$ columns, where each column is a reshaped 2D picture. Note: use np.reshape
to reshape matrix into column
Calculate and plot mean face. Subtract it from each column of the matrix $F$
Calculate SVD decomposition of the shifted matrix F and truncate the obtained representaition: $F_r = U_r S_r V_r^T$.
Here $U_r$ is a matrix with $r$ columns - basis set in a space of faces. $W_r = S_r V_r^T$ is a matrix of coefficients in the basis $U_r$. Note that now every image is represented as a small number of coefficients in the basis $U_r$.
Plot vectors in $U_r$ using subplots. Make sure that you get face-like images. Now you know what eigenfaces are =)
Import testing set which is the rest of photos. Find their coefficients in the basis $U_r$.
Compare the obtained vectors of coefficients to vectors in $W_r$ using cosine similarity and classify testing faces. As an output give indices of faces that were misclassified when $r=5$.
Recommender systems are gaining more and more popularity. They're used in a broad range of areas: music, movies, e-commerce, social and professional networks, online news to name a few. Web services like Pandora, Netflix and Amazon build and deploy their own recommender engines in order to make customers happier and generate additional revenue. Those companies, and many others, take the problem of building high-quality recommender system very seriously. One of many examples of a great interest in this area coming from industries is the famous Netflix competition with $1million prize for winners.
In this task you'll build a very simple yet powerfull engine for recommender system. Given the Movielens 10M data you'll implement an SVD-based model of movie recommendation system. SVD-based approach belongs to a family of collaborative filtering algorithms that use matrix factorization (MF). While there are many sophisticated algorithms for MF, pure SVD remains one of the top-performers. The main idea behind these algorithms is to represent each user and each movie as vectors in some low-dimensional feature space. That low-dimensional space(or *latent factors space*) shows what features a user likes and which of them are present in a movie. Interpreting those features is a separate and hard task and we will simply build the latent factors using SVD without focusing on the intrinsic meaning of those features. With this model the "likeability" of a movie for a particular user is estimated by a weighted inner product of their latent factors. The model should also respond fast and produce recommendations right after a new user demonstrated some of his preferences. Recomputing SVD for this task may take prohibitively long time that's why we'll use folding-in technique for making fresh recommendations quickly.
You're given convenience functions get_movielens_data
and split_data
.
Be aware that these sets are disjoint, e.g. users from the training set are not in the testing set and vice versa. This means that test users will be "new" (e.g. unseen before) for the trained model. In order to produce reasonable recommendations for these "new" users we will use folding-in (see task 2).
Note: downloading the dataset may take a couple of minutes. If you already have a copy of MovieLens 10M data on your computer you may force program to use it by specifying local_file argument in the get_movielens_data
function. The local_file argument should be a full path to the zip file.
import pandas as pd
from pandas.io.common import ZipFile
from requests import get
import numpy as np
import scipy as sp
from scipy import sparse
from collections import namedtuple
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
def get_movielens_data(local_file=None):
'''Downloads movielens data, normalizes users and movies ids,
returns data in sparse CSR format.
'''
if not local_file:
print 'Downloading data...'
zip_file_url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip'
zip_response = get(zip_file_url)
zip_contents = StringIO(zip_response.content)
print 'Done.'
else:
zip_contents = local_file
print 'Loading data into memory...'
with ZipFile(zip_contents) as zfile:
zdata = zfile.read('ml-10M100K/ratings.dat')
delimiter = ';'
zdata = zdata.replace('::', delimiter) # makes data compatible with pandas c-engine
ml_data = pd.read_csv(StringIO(zdata), sep=delimiter, header=None, engine='c',
names=['userid', 'movieid', 'rating', 'timestamp'],
usecols=['userid', 'movieid', 'rating'])
# normalize indices to avoid gaps
ml_data['movieid'] = ml_data.groupby('movieid', sort=False).grouper.group_info[0]
ml_data['userid'] = ml_data.groupby('userid', sort=False).grouper.group_info[0]
# build sparse user-movie matrix
data_shape = ml_data[['userid', 'movieid']].max() + 1
data_matrix = sp.sparse.csr_matrix((ml_data['rating'],
(ml_data['userid'], ml_data['movieid'])),
shape=data_shape, dtype=np.float64)
print 'Done.'
return data_matrix
def split_data(data, test_ratio=0.2):
'''Randomly splits data into training and testing datasets. Default ratio is 80%/20%.
Returns datasets in namedtuple format for convenience. Usage:
train_data, test_data = split_data(data_matrix)
or
movielens_data = split_data(data_matrix)
and later in code:
do smth with movielens_data.train
do smth with movielens_data.test
'''
num_users = data.shape[0]
idx = np.zeros((num_users,), dtype=bool)
sel = np.random.choice(num_users, test_ratio*num_users, replace=False)
np.put(idx, sel, True)
Movielens_data = namedtuple('MovieLens10M', ['train', 'test'])
movielens_data = Movielens_data(train=data[~idx, :], test=data[idx, :])
return movielens_data
Build representation of users and movies in the latent factors space with help of SVD.
numpy.linalg
on your computer for this task?scipy.linalg.svds
.scipy
returns singular values in ascending order (see the docs).np.ascontiguousarray
to fix that.
Evaluation of the model is done by splitting test dataset into 2 subsets:
Overall perfromance is measured by the total number of correct predictions made by the model on the test set.
Your tasks:
.nonzero()
or .indices
might be helpful.numpy.in1d
function.A new user can be considered as an update to original matrix (appending new row). Appending a row in the original matrix corresponds to appending a row into the users latent factors matrix in the SVD decomposition. We can formulate the relation between theese two updates as (see here for details and picture above, for single user $q$ = 1): $$ p^T = x^TVS^{-1} $$ Where $p$ is an update to latent factors matrix and $x$ is an update to original user-movie matrix (e.g. user's preferences or behaviour history). Then, to compute recommendations for the new user we simply restore a part of the original matrix, corresponding to the update: $$ r^T = p^TSV^T = x^TVV^T $$ where $r$ is our recommendations vector that we're looking for: $$ r = VV^Tx $$
Note, that matrix $P = VV^T$ satisfies the following property: $P^2 = P$, e.g. $P$ - is a projector. This means that our folding-in procedure can be naturally describied as a projection of the user preferences onto the latent factors space.
initialize total_score variable with 0
for each user in test-data:
rating_history <-- all but the last N movies rated by user
evaluation_set <-- the last N movies rated by user
initialize user_preferences_vector with 1s using rating_history as indices of non-zero values
top-N recommendations <-- folding-in applied to user_preferences_vector
correct_predictions <-- # of intersections of recommendations and evaluation_set
total_score <-- total_score + correct_predictions
return total_score
You may use different implementation at your will, the only two criterias:
Optionally: You may want to test you parameters with different data splittings in order to minimize risk of local effects. You're also free to add modifications to your code for producing better results. Report what modificatons you've done and what effect it had if any.