The following cell downloads a csv
file that contains the evaluation scores of 9,000+ movies rated by 600+ customers. It is made of 100,000+ rows where every row explains that the user userId
gave rating
points to the movie movieId
.
Your job is to use the given data set to design a movie recommendation system that is able to suggest a specific user several movies that s/he has not watched and will probably like.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('http://jonghank.github.io/ee786/files/ratings.csv')
The first five rows:
df.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
and the last five rows:
df.tail()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
100831 | 610 | 166534 | 4.0 | 1493848402 |
100832 | 610 | 168248 | 5.0 | 1493850091 |
100833 | 610 | 168250 | 5.0 | 1494273047 |
100834 | 610 | 168252 | 5.0 | 1493846352 |
100835 | 610 | 170875 | 3.0 | 1493846415 |
For your information, the title of the movie corresponding to a specific movieId
can be found from the following.
md = pd.read_csv('http://jonghank.github.io/ee786/files/movies.csv')
The first five rows,
md.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
and the last five.
md.tail()
movieId | title | genres | |
---|---|---|---|
9737 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy |
9738 | 193583 | No Game No Life: Zero (2017) | Animation|Comedy|Fantasy |
9739 | 193585 | Flint (2017) | Drama |
9740 | 193587 | Bungo Stray Dogs: Dead Apple (2018) | Action|Animation |
9741 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy |
Now we express the above rating information by a binary matrix $B\in\B^{m\times n}$ and a real matrix $R\in\R^{m\times n}$, where $m$ and $n$ represent the number of the users and the numer of the evaluated movies, respectively.
Each row in $B$ tells which movies the specific user has evaluated, and $R$ tells the evaluation scores for those combinations specified by $B$. So $B_{ij}=1$ when user $i$ gave the evaluation score $R_{ij}$ to the movie $j$,and $B_{ij}=0$ if the user $i$ did not evaluated the movie $j$.
In the code below, B
and R
contains $B$ and $R$. In addition, T
contains the list of movie ID's. Those matrices contain 611 users' rating scores on 193,610 movies. So the rating matrix $R$ for the first 1,000 movieIDs looks like below.
import scipy.sparse as ssp
userId = df['userId'].values
movieId = df['movieId'].values
rating = df['rating'].values
movieIdList = md['movieId'].values
titleList = md['title'].values
R = ssp.csr_matrix( (rating,(userId,movieId)) )
B = ssp.csr_matrix( (np.ones_like(rating),(userId,movieId)), dtype='int' )
userIdList = np.arange(B.shape[0], dtype='int')
titleIdList = np.arange(B.shape[1], dtype='int')
m, n = B.shape
plt.figure(figsize=(10,10))
plt.imshow(R[:,:1000].toarray(), cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table: {m} users, {n} movies')
plt.show()
The above image is extremely sparse and it contains lots of zero columns; there are actually a bunch of movies which did not receive any evaluation from anyone. So will cut the columns (movies) with few evaluations away, and similarly cut the rows (users) who marked few evaluations. This will reduce the size of the matrix, while conserving the dominant information.
The following removes the movies with less than 60 evaluations, and then removes the users who marked less than 40 evaluations. This leaves the reduced size $R$ and $B$ which represent rating information on 328 movies from 288 users. The selected user IDs are contained in selectedUserNumbers
and the selected movie Titles are contained in selectedMovieTitles
.
minRatesPerPerson, minRatesPerMovie = 40, 60
selectedMovies = np.sum(B, axis=0)>minRatesPerMovie
selectedMovies = np.asarray(selectedMovies).flatten()
R = R[:, selectedMovies]
B = B[:, selectedMovies]
titleIdList = titleIdList[selectedMovies]
selectedUsers = np.sum(B, axis=1)>minRatesPerPerson
selectedUsers = np.asarray(selectedUsers).flatten()
R = R[selectedUsers, :]
B = B[selectedUsers, :]
userIdList = userIdList[selectedUsers]
selectedUserNumbers = userIdList
selectedMovieTitles = [ md['title'][md['movieId']==i] for i in titleIdList ]
m, n = R.shape
plt.figure(figsize=(10,10))
plt.imshow(R.toarray(), cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table: {m} users, {n} movies')
plt.show()
We now randomly erase ~10% of the evaluation data so that we can use them as a validation set. The remaining data will be used as a "training" set. This is shown below.
B = B.toarray()
R = R.toarray()
B_true = np.copy(B)
R_true = np.copy(R)
for i in range(m):
for j in range(n):
if B[i,j]>0:
if np.random.rand()>0.9:
B[i,j] = 0
R[i,j] = 0
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table (training set): {m} users, {n} movies')
plt.subplot(122)
plt.imshow(R_true-R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title(f'Rating table (validation set): {m} users, {n} movies')
plt.show()
OK, now everything is ready. We have $B, R\in\R^{288\times 328}$ which represent the evaluation information on 328 movies from 288 users. Note the following observations:
So the problem is to find some preference matrix $\hat{R}$ such that $\hat{R}_{ij} = R_{ij}$ if $B_{ij}=1$. Then how? Another observation is here.
Now the problem boils down to finding a low rank matrix $\hat{R}$ such that $\hat{R}_{ij} = R_{ij}$ if $B_{ij}=1$, or
$$ \begin{aligned} \underset{P}{\minimize} \quad & \rank P \\ \text{subject to} \quad & P_{ij} = R_{ij}, \quad & \text{if }B_{ij}=0 \end{aligned} $$The $\rank$ of a matrix is not a convex function. An obvious counterexample is with
$$ A=\bmat{1 & 0 \\ 0 & 0}, \quad B=\bmat{0 & 0 \\ 0 & 1} $$where $\rank A = \rank B = 1$, but
$$ \rank\frac{A+B}{2} = 2 > \frac{1}{2}\left( \rank A + \rank B \right) = 1 $$We relax the above nonconvex problem into convex one, by replacing the $\rank$ function by the nuclear norm. The nuclear norm of a matrix is defined by the sum of the nonzero singular values
$$ \|P\|_* = \sum_{i=1}^r \sigma_i(P) $$where $r=\rank P$.
The intuition behind this is that $\|P\|_*$ is the convex envelope (the best pointwise convex approximation) of $\rank P$ for $\|P\|\le1$. Note the analogy with the cardinality function and the $\ell_1$ norm for vectors.
The $\ell_1$ norm of a vector:
The nuclear norm of a matrix:
So the relaxation is like solving the $\ell_1$-norm minimization when we wanted sparse solutions. Solve the relaxed problem, and we are good if the solution is (in many cases in practice) sparse. Now our problem is is cast as follows.
$$ \begin{aligned} \underset{P}{\minimize} \quad & \| P \|_* \\ \text{subject to} \quad & P_{ij} = R_{ij}, \qquad \text{if }B_{ij}=0 \end{aligned} $$It turns out that the above is equivalent to
$$ \begin{aligned} \underset{P, X, Y}{\minimize} \quad & \frac{1}{2}\left( \tr X + \tr Y \right) \\ \text{subject to} \quad & \bmat{X & P \\ P^T & Y} \succeq 0,\\ \quad & P_{ij} = R_{ij}, \qquad \text{if }B_{ij}=0 \end{aligned} $$(Problem 1) Use cvxpy
to solve the above.
# your code here
import cvxpy as cp
P = cp.Variable((m,n))
X = cp.Variable((m,m), symmetric=True)
Y = cp.Variable((n,n), symmetric=True)
obj = cp.trace(X) + cp.trace(Y) + 1.0*cp.sum_squares(P[B>0] - R[B>0])
obj = cp.Minimize(obj)
constr = [ cp.bmat([[X, P], [P.T, Y]]) >> 0 ]
prob = cp.Problem(obj, constr)
prob.solve(verbose=True)
---------------------------------------------------------------------------- SCS v2.1.0 - Splitting Conic Solver (c) Brendan O'Donoghue, Stanford University, 2012 ---------------------------------------------------------------------------- Lin-sys: sparse-direct, nnz in A = 215431 eps = 1.00e-05, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00 acceleration_lookback = 10, rho_x = 1.00e-03 Variables n = 190037, constraints m = 215432 Cones: linear vars: 1 soc vars: 25395, soc blks: 1 sd vars: 190036, sd blks: 1 Setup time: 4.81e-01s ---------------------------------------------------------------------------- Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s) ---------------------------------------------------------------------------- 0| 4.60e+21 5.73e+21 1.00e+00 -8.35e+25 3.31e+25 2.94e+25 3.69e-01 100| 7.57e-04 1.05e-03 6.33e-05 4.91e+03 4.91e+03 2.78e-12 4.51e+01 200| 8.34e-06 1.38e-05 7.08e-06 4.91e+03 4.91e+03 8.97e-13 8.75e+01 220| 1.39e-06 2.25e-06 1.28e-06 4.91e+03 4.91e+03 2.66e-12 9.59e+01 ---------------------------------------------------------------------------- Status: Solved Timing: Solve time: 9.59e+01s Lin-sys: nnz in L factor: 620900, avg solve time: 6.04e-03s Cones: avg projection time: 3.52e-01s Acceleration: avg step time: 6.24e-02s ---------------------------------------------------------------------------- Error metrics: dist(s, K) = 1.3050e-09, dist(y, K*) = 1.2201e-09, s'y/|s||y| = 3.2821e-13 primal res: |Ax + s - b|_2 / (1 + |b|_2) = 1.3900e-06 dual res: |A'y + c|_2 / (1 + |c|_2) = 2.2532e-06 rel gap: |c'x + b'y| / (1 + |c'x| + |b'y|) = 1.2839e-06 ---------------------------------------------------------------------------- c'x = 4909.0787, -b'y = 4909.0661 ============================================================================
4909.078674109426
(Problem 2) Display the reconstructed preference matrix, $P$. Compare it with the rating table, $R$.
#your code here
P_opt = np.round(np.clip(2*P.value, 0, 10))/2
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(P_opt, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title('Preference prediction')
plt.subplot(122)
plt.imshow(R, cmap='gray', vmin=0, vmax=5)
plt.xlabel('Movie id'), plt.ylabel('User id')
plt.title('Rating table')
plt.show()
(Problem 3) Show the singular values of $P$, $X$, and $Y$. Report the number of dominant factors that contribute to an individual’s tastes or preferences on movies.
# your code here
UP, SP, VP = np.linalg.svd(P.value)
UX, SX, VX = np.linalg.svd(X.value)
UY, SY, VY = np.linalg.svd(Y.value)
plt.figure(figsize=(10,10))
plt.subplot(211)
plt.semilogy(SP, label=r'$\sigma(P)$')
plt.semilogy(SX, label=r'$\sigma(X)$')
plt.semilogy(SY, label=r'$\sigma(Y)$')
plt.title('Singular values (linear scale)')
plt.grid(), plt.legend()
plt.subplot(212)
plt.plot(SP, label=r'$\sigma(P)$')
plt.plot(SX, label=r'$\sigma(X)$')
plt.plot(SY, label=r'$\sigma(Y)$')
plt.title('Singular values (log scale)')
plt.grid(), plt.legend(), plt.ylim(0,100)
plt.show()
(Problem 4) Randomly choose 10 users, and compare the preference prediction ($P$) with the training data points (in $R$) and the validation data points (in $R_\text{true}-R$).
# your code here
users = np.random.randint(m, size=(10,))
movies = np.arange(n)
for u in users:
idx_trn = B[u,:]>0
idx_val = np.bitwise_and(B_true[u,:]>0, B[u,:]==0)
plt.figure(figsize=(10,4))
plt.plot(P_opt[u,:], 'o:', label='Preference prediction')
plt.plot(movies[idx_trn], R[u,idx_trn], 'kd', markersize=7, label='Training set')
plt.plot(movies[idx_val], R_true[u,idx_val], 'r*', markersize=12, label='Validation set')
plt.ylabel(f'user #{u}')
plt.ylim(0,5.5), plt.grid(), plt.legend()
plt.show()
(Problem 5) Plot the histogram of the prediction error on the validation set. Compute the prediction RMSE on the validation set. Present a short interpretation on what you've observed.
# your code here
validation_set = np.where(B_true-B>0)
error_validation = P.value[validation_set] - R_true[validation_set]
RMSE_validation = np.std(error_validation)
print(f'Validation RMSE: {RMSE_validation}')
plt.figure(figsize=(10,6))
plt.hist(error_validation, bins=np.arange(-4,4.5,0.5))
plt.grid()
plt.xlabel('Prediction error')
plt.ylabel('Frequency')
plt.title('Prediction accuracy')
plt.show()
Validation RMSE: 0.8149164995984761
This roughly implies that approximately 2/3 of the prediction error is less than 0.8.