Recommendation system¶

$$ \newcommand{\eg}{{\it e.g.}} \newcommand{\ie}{{\it i.e.}} \newcommand{\argmin}{\operatornamewithlimits{argmin}} \newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb} \newcommand{\mf}{\mathbf} \newcommand{\minimize}{{\text{minimize}}} \newcommand{\diag}{{\text{diag}}} \newcommand{\cond}{{\text{cond}}} \newcommand{\rank}{{\text{rank }}} \newcommand{\range}{{\mathcal{R}}} \newcommand{\null}{{\mathcal{N}}} \newcommand{\tr}{{\text{trace }}} \newcommand{\dom}{{\text{dom}}} \newcommand{\dist}{{\text{dist}}} \newcommand{\R}{\mathbf{R}} \newcommand{\B}{\mathbf{B}} \newcommand{\SM}{\mathbf{S}} \newcommand{\ball}{\mathcal{B}} \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\loss}{\ell} \newcommand{\eloss}{\mc{L}} \newcommand{\abs}[1]{| #1 |} \newcommand{\norm}[1]{\| #1 \|} \newcommand{\tp}{T} $$

EE786: Optimization applications, Kyung Hee University.

Jong-Han Kim (jonghank@khu.ac.kr)

The following cell downloads a csv file that contains the evaluation scores of 9,000+ movies rated by 600+ customers. It is made of 100,000+ rows where every row explains that the user userId gave rating points to the movie movieId.

Your job is to use the given data set to design a movie recommendation system that is able to suggest a specific user several movies that s/he has not watched and will probably like.

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('http://jonghank.github.io/ee786/files/ratings.csv')

The first five rows:

In [ ]:

df.head()

Out[ ]:

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

and the last five rows:

In [ ]:

df.tail()

Out[ ]:

	userId	movieId	rating	timestamp
100831	610	166534	4.0	1493848402
100832	610	168248	5.0	1493850091
100833	610	168250	5.0	1494273047
100834	610	168252	5.0	1493846352
100835	610	170875	3.0	1493846415

For your information, the title of the movie corresponding to a specific movieId can be found from the following.

In [ ]:

md = pd.read_csv('http://jonghank.github.io/ee786/files/movies.csv')

The first five rows,

In [ ]:

md.head()

Out[ ]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

and the last five.

In [ ]:

md.tail()

Out[ ]:

	movieId	title	genres
9737	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy
9738	193583	No Game No Life: Zero (2017)	Animation\|Comedy\|Fantasy
9739	193585	Flint (2017)	Drama
9740	193587	Bungo Stray Dogs: Dead Apple (2018)	Action\|Animation
9741	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy

Now we express the above rating information by a binary matrix $B\in\B^{m\times n}$ and a real matrix $R\in\R^{m\times n}$, where $m$ and $n$ represent the number of the users and the numer of the evaluated movies, respectively.

Each row in $B$ tells which movies the specific user has evaluated, and $R$ tells the evaluation scores for those combinations specified by $B$. So $B_{ij}=1$ when user $i$ gave the evaluation score $R_{ij}$ to the movie $j$,and $B_{ij}=0$ if the user $i$ did not evaluated the movie $j$.

In the code below, B and R contains $B$ and $R$. In addition, T contains the list of movie ID's. Those matrices contain 611 users' rating scores on 193,610 movies. So the rating matrix $R$ for the first 1,000 movieIDs looks like below.

In [ ]:

import scipy.sparse as ssp

userId  = df['userId'].values
movieId = df['movieId'].values
rating  = df['rating'].values

movieIdList = md['movieId'].values
titleList   = md['title'].values

R = ssp.csr_matrix( (rating,(userId,movieId)) )
B = ssp.csr_matrix( (np.ones_like(rating),(userId,movieId)), dtype='int' )
userIdList  = np.arange(B.shape[0], dtype='int')
titleIdList = np.arange(B.shape[1], dtype='int')

m, n = B.shape

plt.figure(figsize=(10,10))
plt.imshow(R[:,:1000].toarray(), cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table: {m} users, {n} movies')
plt.show()

The above image is extremely sparse and it contains lots of zero columns; there are actually a bunch of movies which did not receive any evaluation from anyone. So will cut the columns (movies) with few evaluations away, and similarly cut the rows (users) who marked few evaluations. This will reduce the size of the matrix, while conserving the dominant information.

The following removes the movies with less than 60 evaluations, and then removes the users who marked less than 40 evaluations. This leaves the reduced size $R$ and $B$ which represent rating information on 328 movies from 288 users. The selected user IDs are contained in selectedUserNumbers and the selected movie Titles are contained in selectedMovieTitles.

In [ ]:

minRatesPerPerson, minRatesPerMovie = 40, 60

selectedMovies = np.sum(B, axis=0)>minRatesPerMovie
selectedMovies = np.asarray(selectedMovies).flatten()

R = R[:, selectedMovies]
B = B[:, selectedMovies]
titleIdList = titleIdList[selectedMovies]

selectedUsers = np.sum(B, axis=1)>minRatesPerPerson
selectedUsers = np.asarray(selectedUsers).flatten()

R = R[selectedUsers, :]
B = B[selectedUsers, :]
userIdList = userIdList[selectedUsers]

selectedUserNumbers = userIdList
selectedMovieTitles = [ md['title'][md['movieId']==i] for i in titleIdList ]

m, n = R.shape

plt.figure(figsize=(10,10))
plt.imshow(R.toarray(), cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table: {m} users, {n} movies')
plt.show()

We now randomly erase ~10% of the evaluation data so that we can use them as a validation set. The remaining data will be used as a "training" set. This is shown below.

In [ ]:

B = B.toarray()
R = R.toarray()
B_true = np.copy(B)
R_true = np.copy(R)

for i in range(m):
  for j in range(n):
    if B[i,j]>0:
      if np.random.rand()>0.9:
        B[i,j] = 0
        R[i,j] = 0
      
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table (training set): {m} users, {n} movies')
plt.subplot(122)
plt.imshow(R_true-R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table (validation set): {m} users, {n} movies')
plt.show()      

OK, now everything is ready. We have $B, R\in\R^{288\times 328}$ which represent the evaluation information on 328 movies from 288 users. Note the following observations:

The rating matrix $R$ is quite sparse.
A known entry $R_{ij}$ where $B_{ij}=1$ explains how much the user $i$ liked the movie $j$.
If $B_{ij}=0$, user $i$'s preference on the movie $j$, $R_{ij}$, is unknown. If we can fill it in with some nice $\hat{R}_{ij}$, then we can use it as a prediction on how much the user $i$ will like the movie $j$, and finally we will be able to recommend the user $i$ with the movies in the order of high $\hat{R}_{ij}$'s.

So the problem is to find some preference matrix $\hat{R}$ such that $\hat{R}_{ij} = R_{ij}$ if $B_{ij}=1$. Then how? Another observation is here.

The sparse rating matrix $R$ looks like a low rank matrix with some elements missing.
The preference matrix, $\hat{R}$, which is a reconstruction from $R$, should be a low rank matrix since it is generally believed that a user's preference on movies is usually determined by a few dominant factors, such as user's age, sex, movie's genre, director, actress/actor, et cetera.

Now the problem boils down to finding a low rank matrix $\hat{R}$ such that $\hat{R}_{ij} = R_{ij}$ if $B_{ij}=1$, or

$$ \begin{aligned} \underset{P}{\minimize} \quad & \rank P \\ \text{subject to} \quad & P_{ij} = R_{ij}, \quad & \text{if }B_{ij}=0 \end{aligned} $$

The $\rank$ of a matrix is not a convex function. An obvious counterexample is with

$$ A=\bmat{1 & 0 \\ 0 & 0}, \quad B=\bmat{0 & 0 \\ 0 & 1} $$

where $\rank A = \rank B = 1$, but

$$ \rank\frac{A+B}{2} = 2 > \frac{1}{2}\left( \rank A + \rank B \right) = 1 $$

We relax the above nonconvex problem into convex one, by replacing the $\rank$ function by the nuclear norm. The nuclear norm of a matrix is defined by the sum of the nonzero singular values

$$ \|P\|_* = \sum_{i=1}^r \sigma_i(P) $$

where $r=\rank P$.

The intuition behind this is that $\|P\|_*$ is the convex envelope (the best pointwise convex approximation) of $\rank P$ for $\|P\|\le1$. Note the analogy with the cardinality function and the $\ell_1$ norm for vectors.

The $\ell_1$ norm of a vector:

is the sum of the magnitude of nonzero elements
is the convex envelope of the cardinality function, which deals with the sparsity

The nuclear norm of a matrix:

is the sum of the nonzero singular values
is the convex envelope of the rank function, which deals with the sparsity of the singular values.

So the relaxation is like solving the $\ell_1$-norm minimization when we wanted sparse solutions. Solve the relaxed problem, and we are good if the solution is (in many cases in practice) sparse. Now our problem is is cast as follows.

$$ \begin{aligned} \underset{P}{\minimize} \quad & \| P \|_* \\ \text{subject to} \quad & P_{ij} = R_{ij}, \qquad \text{if }B_{ij}=0 \end{aligned} $$

It turns out that the above is equivalent to

$$ \begin{aligned} \underset{P, X, Y}{\minimize} \quad & \frac{1}{2}\left( \tr X + \tr Y \right) \\ \text{subject to} \quad & \bmat{X & P \\ P^T & Y} \succeq 0,\\ \quad & P_{ij} = R_{ij}, \qquad \text{if }B_{ij}=0 \end{aligned} $$

(Problem 1) Use cvxpy to solve the above.

In [ ]:

# your code here

import cvxpy as cp

P = cp.Variable((m,n))
X = cp.Variable((m,m), symmetric=True)
Y = cp.Variable((n,n), symmetric=True)

obj = cp.trace(X) + cp.trace(Y) + 1.0*cp.sum_squares(P[B>0] - R[B>0])
obj = cp.Minimize(obj)
constr = [ cp.bmat([[X, P], [P.T, Y]]) >> 0 ]      
prob = cp.Problem(obj, constr)
prob.solve(verbose=True)

----------------------------------------------------------------------------
	SCS v2.1.0 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 215431
eps = 1.00e-05, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 190037, constraints m = 215432
Cones:	linear vars: 1
	soc vars: 25395, soc blks: 1
	sd vars: 190036, sd blks: 1
Setup time: 4.81e-01s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0| 4.60e+21  5.73e+21  1.00e+00 -8.35e+25  3.31e+25  2.94e+25  3.69e-01 
   100| 7.57e-04  1.05e-03  6.33e-05  4.91e+03  4.91e+03  2.78e-12  4.51e+01 
   200| 8.34e-06  1.38e-05  7.08e-06  4.91e+03  4.91e+03  8.97e-13  8.75e+01 
   220| 1.39e-06  2.25e-06  1.28e-06  4.91e+03  4.91e+03  2.66e-12  9.59e+01 
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 9.59e+01s
	Lin-sys: nnz in L factor: 620900, avg solve time: 6.04e-03s
	Cones: avg projection time: 3.52e-01s
	Acceleration: avg step time: 6.24e-02s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 1.3050e-09, dist(y, K*) = 1.2201e-09, s'y/|s||y| = 3.2821e-13
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 1.3900e-06
dual res:   |A'y + c|_2 / (1 + |c|_2) = 2.2532e-06
rel gap:    |c'x + b'y| / (1 + |c'x| + |b'y|) = 1.2839e-06
----------------------------------------------------------------------------
c'x = 4909.0787, -b'y = 4909.0661
============================================================================

Out[ ]:

4909.078674109426

(Problem 2) Display the reconstructed preference matrix, $P$. Compare it with the rating table, $R$.

In [ ]:

#your code here

P_opt = np.round(np.clip(2*P.value, 0, 10))/2        

plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(P_opt, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title('Preference prediction')
plt.subplot(122)
plt.imshow(R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title('Rating table')
plt.show()

(Problem 3) Show the singular values of $P$, $X$, and $Y$. Report the number of dominant factors that contribute to an individual’s tastes or preferences on movies.

In [ ]:

# your code here

UP, SP, VP = np.linalg.svd(P.value)
UX, SX, VX = np.linalg.svd(X.value)
UY, SY, VY = np.linalg.svd(Y.value)

plt.figure(figsize=(10,10))
plt.subplot(211)
plt.semilogy(SP, label=r'$\sigma(P)$')
plt.semilogy(SX, label=r'$\sigma(X)$')
plt.semilogy(SY, label=r'$\sigma(Y)$')
plt.title('Singular values (linear scale)')
plt.grid(), plt.legend()
plt.subplot(212)
plt.plot(SP, label=r'$\sigma(P)$')
plt.plot(SX, label=r'$\sigma(X)$')
plt.plot(SY, label=r'$\sigma(Y)$')
plt.title('Singular values (log scale)')
plt.grid(), plt.legend(), plt.ylim(0,100)
plt.show()

(Problem 4) Randomly choose 10 users, and compare the preference prediction ($P$) with the training data points (in $R$) and the validation data points (in $R_\text{true}-R$).

In [ ]:

# your code here

users = np.random.randint(m, size=(10,))
movies = np.arange(n)
for u in users:
    idx_trn = B[u,:]>0
    idx_val = np.bitwise_and(B_true[u,:]>0, B[u,:]==0)
    plt.figure(figsize=(10,4))
    plt.plot(P_opt[u,:], 'o:', label='Preference prediction')
    plt.plot(movies[idx_trn], R[u,idx_trn], 'kd', markersize=7, label='Training set')
    plt.plot(movies[idx_val], R_true[u,idx_val], 'r*', markersize=12, label='Validation set')
    plt.ylabel(f'user #{u}')
    plt.ylim(0,5.5), plt.grid(), plt.legend()
    plt.show()

(Problem 5) Plot the histogram of the prediction error on the validation set. Compute the prediction RMSE on the validation set. Present a short interpretation on what you've observed.

In [ ]:

# your code here

validation_set = np.where(B_true-B>0)
error_validation = P.value[validation_set] - R_true[validation_set]
RMSE_validation = np.std(error_validation)
print(f'Validation RMSE: {RMSE_validation}')

plt.figure(figsize=(10,6))
plt.hist(error_validation, bins=np.arange(-4,4.5,0.5))
plt.grid()
plt.xlabel('Prediction error')
plt.ylabel('Frequency')
plt.title('Prediction accuracy')
plt.show()

Validation RMSE: 0.8149164995984761

This roughly implies that approximately 2/3 of the prediction error is less than 0.8.