%pylab inline
import pandas as pd
import numpy as np
from __future__ import division
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import logging
logger = logging.getLogger()
Populating the interactive namespace from numpy and matplotlib
two broad groups:
Content-based systems
focus on the properities of items.
Collaborative filtering systems
focus on the relationship between users and items.
# Example 9.1
M = pd.DataFrame(index=['A', 'B', 'C', 'D'], columns=['HP1', 'HP2', 'HP3', 'TW', 'SW1', 'SW2', 'SW3'])
M.loc['A', ['HP1', 'TW', 'SW1']] = [4, 5, 1]
M.iloc[1, 0:3] = [5, 5, 4]
M.iloc[2, 3:-1] = [2, 4, 5]
M.iloc[3, [1, -1]] = [3, 3]
M_9_1 = M
M_9_1
HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
---|---|---|---|---|---|---|---|
A | 4 | NaN | NaN | 5 | 1 | NaN | NaN |
B | 5 | 5 | 4 | NaN | NaN | NaN | NaN |
C | NaN | NaN | NaN | 2 | 4 | 5 | NaN |
D | NaN | 3 | NaN | NaN | NaN | NaN | 3 |
In practice, the matrix would be even sparser, with the typical user rating only a tiny fraction of all avalibale items.
the goal of a recommendation system is: to predict the blanks in the utility matrix.
physical institutions | online institutions |
---|---|
provide only the most popular items | provide the entire range of items |
the long tail force online institutions to recommend items to individual users:
It's no possible to present all avaliable items to the user.
Neither can we expect users to have heared of each of the items they might like.
Product Recommendations
Movie Recommendations
News Articles
how to discovery the value users place on items:
We can ask users to rate items.
cons: users are unwilling to do, and so samples are biased by very little fraction of peoples.
We can make inferences from users' behavior.
eg: items purchased/viewed/rated.
a record representing important characteristics of items.
for Documents
idea: find the identification of words that characterize the topic of a document.
namely, we expect a sets of words to express the subjects or main ideas of the document.
to measure the similarity of two documents, the distance measures we could use are:
for Images
invite users to tag the items.
cons: users are unwilling to do $\implies$ there are not enough tags (bias).
feature is discrete. $\to$ boolean value.
feature is numerical. $\to$ normalization.
create vectors with the same components of item profiles to describe the user's preferences.
It could be derived from utility matrix and item profiles.
normalizate untility matrix. ($[-1,1]$ for cosine distance).
value in user profiles = utility value * corresponding item vectors.
# example 9.4
users_name = ['U', 'V']
items_name = ['F{}'.format(x) for x in range(4)]
features_name = ['Julia Roberts', 'others']
# utility matrix
M_uti = pd.DataFrame([
[3, 4, 5, 0],
[6, 2, 3, 5]
],
index=users_name,
columns=items_name
)
M_uti
F0 | F1 | F2 | F3 | |
---|---|---|---|---|
U | 3 | 4 | 5 | 0 |
V | 6 | 2 | 3 | 5 |
# item profile
M_item = pd.DataFrame(index=items_name, columns=features_name)
M_item.loc[:, features_name[0]] = 1
M_item = M_item.fillna(value=0)
M_item
Julia Roberts | others | |
---|---|---|
F0 | 1 | 0 |
F1 | 1 | 0 |
F2 | 1 | 0 |
F3 | 1 | 0 |
M_uti.apply(lambda x: x - np.mean(x), axis=1)
F0 | F1 | F2 | F3 | |
---|---|---|---|---|
U | 0 | 1 | 2 | -3 |
V | 2 | -2 | -1 | 1 |
M_user = M_uti.fillna(value=0).dot(M_item) / 4 #average = sum/len
M_user
Julia Roberts | others | |
---|---|---|
U | 3 | 0 |
V | 4 | 0 |
to estimate:
$$M_{utility}[user, item] = cosineDistant(M_{user}, M_{item})$$
the more similar, the higher probility to recommend.
classification algorithms:
Recommend or Not (machine learning):
one decision per user $\to$ take too long time to construct.
be used only for relatively small problem size.
# exercise 9.2.1
raw_data = [
[3.06, 2.68, 2.92],
[500, 320, 640],
[6, 4, 6]
]
M_item = pd.DataFrame(raw_data, index=['Processor Speed', 'Disk Size', 'Main-Memory Size'], columns=['A', 'B', 'C'])
# items: A, B, C; features: Processor Speed, Disk Size, ...
M_item
A | B | C | |
---|---|---|---|
Processor Speed | 3.06 | 2.68 | 2.92 |
Disk Size | 500.00 | 320.00 | 640.00 |
Main-Memory Size | 6.00 | 4.00 | 6.00 |
# exercise 9.2.1
# (d)
M_item.apply(lambda x: x / np.mean(x), axis=1)
A | B | C | |
---|---|---|---|
Processor Speed | 1.060046 | 0.928406 | 1.011547 |
Disk Size | 1.027397 | 0.657534 | 1.315068 |
Main-Memory Size | 1.125000 | 0.750000 | 1.125000 |
# exercise 9.2.2
# (a)
M_item.apply(lambda x: x - np.mean(x), axis=1)
A | B | C | |
---|---|---|---|
Processor Speed | 0.173333 | -0.206667 | 0.033333 |
Disk Size | 13.333333 | -166.666667 | 153.333333 |
Main-Memory Size | 0.666667 | -1.333333 | 0.666667 |
# exercise 9.2.3
M_uti = pd.DataFrame([[4, 2, 5]], index=['user'], columns=['A', 'B', 'C'])
M_uti
A | B | C | |
---|---|---|---|
user | 4 | 2 | 5 |
# (a)
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
M_uti_nor
A | B | C | |
---|---|---|---|
user | 0.333333 | -1.666667 | 1.333333 |
# (b)
M_user = M_item.dot(M_uti_nor.T) / 3
M_user
user | |
---|---|
Processor Speed | 0.148889 |
Disk Size | 162.222222 |
Main-Memory Size | 1.111111 |
logger.setLevel('WARN')
def create_user_profile(utility_matrix, item_profile):
"""Create user profile by combining utility matrix with item profile in 9.2.5 ."""
assert np.array_equal(utility_matrix.columns, item_profile.columns), \
"utility matrix should keep same columns name with item profile."
logger.info('utility_matrix: \n{}\n'.format(utility_matrix))
M_uti_notnull = np.ones(utility_matrix.shape)
M_uti_notnull[utility_matrix.isnull().values] = 0
logger.info('utility_matrix_isnull: \n{}\n'.format(M_uti_notnull))
logger.info('utility_matrix: \n{}\n'.format(item_profile))
M_item_notnull = np.ones(item_profile.shape)
M_item_notnull[item_profile.isnull().values] = 0
logger.info('utility_matrix_isnull: \n{}\n'.format(M_item_notnull))
utility_matrix = utility_matrix.fillna(value=0)
item_profile = item_profile.fillna(value=0)
M_user = item_profile.dot(utility_matrix.T).values / np.dot(M_item_notnull, M_uti_notnull.T)
M_user[np.isinf(M_user)] = np.nan # solve: divide zero
logger.info('M_user: \n{}\n'.format(M_user))
return pd.DataFrame(M_user, index=item_profile.index, columns=utility_matrix.index)
M_uti = pd.DataFrame([[4, 2, 5], [1, np.nan, 3]], index=['userA', 'userB'], columns=['A', 'B', 'C'])
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
print('utility matrix: \n{}\n'.format(M_uti_nor))
print('item profile: \n{}\n'.format(M_item))
create_user_profile(M_uti_nor, M_item)
utility matrix: A B C userA 0.333333 -1.666667 1.333333 userB -1.000000 NaN 1.000000 item profile: A B C Processor Speed 3.06 2.68 2.92 Disk Size 500.00 320.00 640.00 Main-Memory Size 6.00 4.00 6.00
userA | userB | |
---|---|---|
Processor Speed | 0.148889 | -0.07 |
Disk Size | 162.222222 | 70.00 |
Main-Memory Size | 1.111111 | 0.00 |
# Fig 9.4
M_9_1
HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
---|---|---|---|---|---|---|---|
A | 4 | NaN | NaN | 5 | 1 | NaN | NaN |
B | 5 | 5 | 4 | NaN | NaN | NaN | NaN |
C | NaN | NaN | NaN | 2 | 4 | 5 | NaN |
D | NaN | 3 | NaN | NaN | NaN | NaN | 3 |
# rounding the data
M_round = M_9_1.copy()
M_round[M_9_1 <= 2] = np.nan
M_round[M_9_1 > 2] = 1
M_round
HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
---|---|---|---|---|---|---|---|
A | 1 | NaN | NaN | 1 | NaN | NaN | NaN |
B | 1 | 1 | 1 | NaN | NaN | NaN | NaN |
C | NaN | NaN | NaN | NaN | 1 | 1 | NaN |
D | NaN | 1 | NaN | NaN | NaN | NaN | 1 |
# normalizing ratings
M_norm = M_9_1.apply(lambda x: x - np.mean(x), axis=1)
M_norm
HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
---|---|---|---|---|---|---|---|
A | 0.666667 | NaN | NaN | 1.666667 | -2.333333 | NaN | NaN |
B | 0.333333 | 0.333333 | -0.666667 | NaN | NaN | NaN | NaN |
C | NaN | NaN | NaN | -1.666667 | 0.333333 | 1.333333 | NaN |
D | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0 |
We can use information about users to recommend items, whereas even if we find pairs of similar items, it takes an additional step in order to recommend items to users.
find $n$ similar users $\to$ recommend item $I$ to user $U$.
normalize the utility matrix first.
\begin{align}
M[U,I] &= Ave(M[U,:]) + Ave(M[0:n,I] - Ave(M[0:n,I])) \\
&\approx Ave(M[U,:]) + Std(M[0:n,I])
\end{align}
find $m$ similar items $\to$ recommend item $I$ to user $U$.
$$M[U,I] = Ave(M[U,0:m])$$
in order to recommend items to user $U$, we need to find all or most of entry in $M[U,:]$.
tradeoff:
user-item: find similar users, directly get all predict values of all potent items.
item-item: find similar items, we need calculate all items one by one (additional step) to fill $M[U,:]$.
item-item similarity often provides more reliable information due to the simplicity of items (genre).
precompute preferred items for each user.
utility matrix evolves slowly $\implies$ compute it infrequently and assume that it remains fixed between recomputations.
Items tend to be classifiable in simple terms (eg: genre), whereas the individuals are complex.
Hierachical approach is prefered:
leave many cluster unmerged at first.
cluster items, and average the corresponding value in utility matrix.
cluster users, and average as well.
repeat several times if we like.
Predict $M[U,I]:
$U \in C$, and $I \in D$.
predict:
\begin{equation} M[U,I] = \begin{cases} M_{revised}[C,D] & \quad \text{if existed.} \\ \text{estimate using similar users/items} & \quad \text{otherwise} \end{cases} \end{equation}# Fig 9.8
raw_data = [
[4, 5, np.nan, 5, 1, np.nan, 3, 2],
[np.nan, 3, 4, 3, 1, 2, 1, np.nan],
[2, np.nan, 1, 3, np.nan, 4, 5, 3]
]
import string
M_uti = pd.DataFrame(raw_data, index=list(string.uppercase[:3]), columns=list(string.lowercase[:8]))
M_uti
a | b | c | d | e | f | g | h | |
---|---|---|---|---|---|---|---|---|
A | 4 | 5 | NaN | 5 | 1 | NaN | 3 | 2 |
B | NaN | 3 | 4 | 3 | 1 | 2 | 1 | NaN |
C | 2 | NaN | 1 | 3 | NaN | 4 | 5 | 3 |
logger.setLevel('WARN')
# exercise 9.3.1
from scipy.spatial.distance import jaccard, cosine
from itertools import combinations
def calc_distance_among_matrix(M, func_dis):
for c in list(combinations(M.index, 2)):
logger.info('c: {}'.format(c))
u, v = M.loc[c[0]], M.loc[c[1]]
logger.info('\n u:{},\n v:{}\n'.format(u.values,v.values))
print('{} {}: {}'.format(c, func_dis.__name__, func_dis(u,v)))
# (a)
calc_distance_among_matrix(M_uti.notnull(), jaccard)
('A', 'B') jaccard: 0.5 ('A', 'C') jaccard: 0.5 ('B', 'C') jaccard: 0.5
# (b)
calc_distance_among_matrix(M_uti.fillna(value=0), cosine)
('A', 'B') cosine: 0.398959235991 ('A', 'C') cosine: 0.385081306188 ('B', 'C') cosine: 0.486129880223
# (c)
M_tmp = M_uti.copy()
M_tmp[M_uti < 3] = 0
M_tmp[M_uti >= 3] = 1
calc_distance_among_matrix(M_tmp, jaccard)
('A', 'B') jaccard: 0.714285714286 ('A', 'C') jaccard: 0.75 ('B', 'C') jaccard: 0.875
# (d)
calc_distance_among_matrix(M_tmp.fillna(value=0), cosine)
('A', 'B') cosine: 0.42264973081 ('A', 'C') cosine: 0.5 ('B', 'C') cosine: 0.711324865405
# (e)
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
M_uti_nor
a | b | c | d | e | f | g | h | |
---|---|---|---|---|---|---|---|---|
A | 0.666667 | 1.666667 | NaN | 1.666667 | -2.333333 | NaN | -0.333333 | -1.333333 |
B | NaN | 0.666667 | 1.666667 | 0.666667 | -1.333333 | -0.333333 | -1.333333 | NaN |
C | -1.000000 | NaN | -2.000000 | 0.000000 | NaN | 1.000000 | 2.000000 | 0.000000 |
# (f)
calc_distance_among_matrix(M_uti_nor.fillna(value=0), cosine)
('A', 'B') cosine: 0.415693452532 ('A', 'C') cosine: 1.11547005384 ('B', 'C') cosine: 1.73957399695
# exercise 9.3.2
#todo
UV-decomposition: $$M = U \times V$$
measure: RMSE (root-mean-square error)
Preprocessing of the matrix $M$.
normalization:
Initializing $U$ and $V$.
choice: gives the elements of $UV$ the average of the nonblank elements of $M$.
$\implies$ the element of $U$ and $V$ should be $\sqrt{a/d}$,
where $a$ is the average nonblank element of $M$, $d$ is the lengths of the short sides of $U$ and $V$.
local minima contains global minima:
Performing the Optimization.
different optimization path:
choose a permutation of the elements and follow that order for every round.
Gradient Descent $\to$ stochastic gradient descent.
Converging to a Minimum.
track the amount of improvement in the RMSE obtained.
stop condition:
solutions:
optimized by only moving the value of a component a fraction of the way from its current value toward its optimized value.
Stop before the process has converged.
Take several different $UV$ decompositions, and average their predicts.
# exercise 9.4.6
#todo
some facts:
CineMatch was not a very good algorithm.
UV-decomposition alorithm given a 7% improvement over CineMatch when couped with normalization and a few other tricks.
Combing different algorithms is a preferred strategy.
Genre and other information in IMDB was no useful.
Time of rating turned out to be useful: upward or downward slope with time.
todo: read the papers introduced in the chapter.