Collaborative filtering with side information¶

This IPython notebook illustrates the usage of the cmfrec Python package for collective matrix factorization using the MovieLens-100k data, consisting of ratings from users about movies + user demographic information + movie genres.

Collective matrix factorization is a technique for collaborative filtering with additional information about the users and items, based on low-rank joint factorization of different matrices with shared factors – for more details see the paper Singh, A. P., & Gordon, G. J. (2008, August). Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 650-658). ACM..

** Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following this link. **

Sections¶

1. Basic model - only movie ratings

1.1 Loading the ratings data
1.2 Train and test split
1.3 Fitting and evaluating the model

2. Adding movie genres

2.1 Loading the movie genres info
2.2 Fitting and evaluating model with genres

3. Adding movie genres and user demographic info

3.1 Loading the user demographic info
3.2 Getting user region from zip codes
3.2 Fitting and evaluating the full model

4. Comparing recommendations

1. Basic model - only movie ratings¶

As a starting point, I'll first try the basic low-rank factorization model using ratings data alone - that is, trying to minimize the following function:

$$ Loss\:(U, V) = \lVert X - UV^T\lVert^2\: + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2) $$

Where $U$ and $V$ are lower-dimensional matrices mapping users and items into a latent space - this is the classic model popularized by Funk. The predicted rating from this model for a given user $i$ and movie $j$ can be calculated as $U[i,:]*V[j,:]^T$

1.1 Loading the ratings data¶

Here I'll load the MovieLens-100k ratings data, which can be downloaded from the link presented at the beginning:

In [1]:

import pandas as pd, time
from datetime import datetime

ratings=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.data',sep='\t',engine='python',names=['UserId','ItemId','Rating','Timestamp'])
ratings['Timestamp']=ratings.Timestamp.map(lambda x: datetime(*time.localtime(x)[:6])).map(lambda x: pd.to_datetime(x))
ratings=ratings.sort_values(['UserId','ItemId']).reset_index(drop=True)
ratings.head()

Out[1]:

	UserId	ItemId	Rating	Timestamp
0	1	1	5	1997-09-23 01:02:38
1	1	2	3	1997-10-15 08:26:11
2	1	3	4	1997-11-03 09:42:40
3	1	4	3	1997-10-15 08:25:19
4	1	5	3	1998-03-13 03:15:12

1.2 Train and test split¶

In order to evaluate the model, I'll create a train and test set split to use throughout the whole notebook. As this kind of model can only recommend items that were in the training set to users who also were in the training set, I'll make the test set contain only elements that were present in the train set.

In order to make this more realistic, I'll make it as a temporal split, i.e. splitting the ratings as those who were submitted before and after a certain time cutoff.

In [2]:

time_cutoff='1998-01-01'
train=ratings.loc[ratings.Timestamp<=time_cutoff]
test=ratings.loc[ratings.Timestamp>time_cutoff]
users_train=set(list(train.UserId))
items_train=set(list(train.ItemId))
test=test.loc[test.UserId.map(lambda x: x in users_train)]
test=test.loc[test.ItemId.map(lambda x: x in items_train)]
print(train.shape)
print(test.shape)

(52884, 4)
(5835, 4)

Note that this is a very small sample, in a typical setting you would have 3 or 4 orders of magnitude more. Nevertheless, this smallish data is enough to see a difference between models.

In [3]:

print(len(users_train))
print(len(items_train))

529
1493

1.3 Fitting and evaluating the model¶

Traditionally, recommendations have been evaluated by their cross-validated RMSE (root mean squared error), but this is not really a good metric and higher values might not translate into better-liked recommendations. There are many additional metrics that can be used, but to keep this example simple, I’ll evaluate the rating that users would have given to the Top-5 recommendations from this model and compare this to recommendations by item popularity and to random recommendations.

In [4]:

from cmfrec import CMF
import numpy as np

# Number of latent factors
k=40

# Regularization parameter
reg=10

# Fitting the model
rec=CMF(k=k, reg_param=reg)
rec.fit(train, random_seed=12345)

# Making predictions
test['Predicted']=test.apply(lambda x: rec.predict(x['UserId'],x['ItemId']),axis=1)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

Evaluating Hold-Out RMSE (the hyperparameters had already been somewhat tuned by cross-validation)

In [5]:

np.sqrt(np.mean((test.Predicted-test.Rating)**2))

Out[5]:

1.2647716220817762

Basic evaluation of this model:

In [6]:

avg_ratings=train.groupby('ItemId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 4.016845329249617
Average rating for bottom-5 (non-)recommendations from this model: 3.116385911179173

The recommendations from this model are not bad, but the average rating of the Top-5 doesn't manage to beat a most-popular recommendation! This is not surprising given the small size of the ratings data though.

2. Adding movie genres¶

The previous model can be extended by adding some additional information about the movies - this can be done by also factorizing the movie-genre matrix and sharing the item-factor matrix in the factorization of the user-item ratings. Now the model becomes:

$$ Loss\:(U, V, Z) = \lVert X - UV^T\lVert^2\: + \:\lVert M-VZ^T \lVert^2\: + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2) $$

Where $U$, $V$ and $Z$ are lower-dimensional matrices mapping users, items and genres into a latent space. The predicted rating from this model for a given user $i$ and movie $j$ is still calculated the same as before: $U[i,:]*V[j,:]^T$. However, we can intuitively think that an item-factor matrix that also represents genres might be better than one that does not, and less likely to overfit, as these factors are not so free.

The matrix $V$ however doesn't need to be exactly the same in both terms - we can also add some additional factors that appear in only one factorization, making the follwing formula:

$$ Loss\:(U, V, Z) = \lVert X - UV_{main}^T\lVert^2\: + \:\lVert M-V_{sec}Z^T \lVert^2\: + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2) $$

Where $ V_{main} = V_{[1\:to\:k_{main} + k_{shared} ,\:\cdot]}$ and $V_{sec} = V_{[k_{main} +1 \:to\: k_{main} + k_{shared} + k_{sec},\:\cdot]}$

2.1 Loading the movie genres info¶

The MovieLens-100k data also comes with a file containing movie information that we can use to enhance the model - note that the package requires the item side information to have a column named ItemId when you pass it to the API. If your data doesn't require any reindexing, you can also pass it as a numpy array and set the option reindex to False.

In [7]:

colnames=['ItemId','Title','ReleaseDate','Sep','Link']+['genre'+str(i) for i in range(19)]
genres=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.item',sep="|",engine='python',names=colnames)

# will save the movie titles for later
movie_id_to_title={i.ItemId:i.Title for i in genres.itertuples()}

genres=genres[['ItemId']+['genre'+str(i) for i in range(19)]]
genres.head()

Out[7]:

	ItemId	genre1	genre2	genre3	genre4	genre5	genre6	genre8	genre16
0	1	0	0	1	1	1	0	0	0
1	2	1	1	0	0	0	0	0	1
2	3	0	0	0	0	0	0	0	1
3	4	1	0	0	0	1	0	1	0
4	5	0	0	0	0	0	1	1	1

2.2 Fitting and evaluating model with genres¶

These hypterparameters (number of factors and regularization) were also somewhat tuned beforehand:

In [8]:

# Number of latent factors
k=30
k_main=10
k_sec=10

# Regularization parameter
reg=10

# Fitting the model
rec2=CMF(k=k, k_main=k_main, k_item=k_sec, reg_param=reg)
rec2.fit(train, genres, random_seed=10000)

# Making predictions
test['Predicted']=test.apply(lambda x: rec2.predict(x['UserId'],x['ItemId']),axis=1)

RMSE now:

In [9]:

np.sqrt(np.mean((test.Predicted-test.Rating)**2))

Out[9]:

1.2610262136540786

Same evaluation as before:

In [10]:

test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 4.03062787136294
Average rating for bottom-5 (non-)recommendations from this model: 3.113323124042879

Now we see a bit of an improvement - it's not too large, but it's nevertheless an improvement, and this time these personalized recommendations get overall higher ratings than most-popular recommendations with as little as 50k ratings.

Knowing these generic genres shouldn't be a complete game changer so this is expected.

2. Adding movie genres and user demographic info¶

The previous model can be extended to incorporate user information in the same way as it added movie genres:

$$ Loss\:(U, V, Z, P) = \lVert X - UV^T\lVert^2\: + \:\lVert M-VZ^T \lVert^2\: + \:\lVert Q-UP^T \lVert^2\: + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2 + \lVert P \lVert^2) $$

Where $Q$ is the user attribute matrix and $P$ is the new attribute-factor matrix - same as before, some of the factors can be shared and some be specific to one factorization.

Intuitively, since in a typical setting there are usually more users than items (not in this particular example though), and each user has on average fewer rated movies than movies have users rating them, it would be logical to assume that detailed user information should be more valuable than detailed item information.

3.1 Loading the user demographic info¶

The MovieLens-100k data also comes with user demographic information - same as before, the data frame passed to the package API should have a column named UserId:

In [11]:

user_info=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.user',sep="|",engine='python',
                        names=['UserId','Age','Gender','Occupation','Zipcode'])
user_info.head()

Out[11]:

	UserId	Age	Gender	Occupation	Zipcode
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

This time, unfortunately, not all the information can be used as it is in the file. The zip code can still provide valuable information if we can link it to a broader geographical area. As these are mostly US users, I'll try to link it to US regions here.

In order to do so, I’m using a publicly available table mapping zip codes to states, another one mapping state names to their abbreviations, and finally classifying the states into regions according to usual definitions.

3.2 Getting user region from zip codes¶

In [12]:

import re

zipcode_abbs=pd.read_csv("D:\\Downloads\\movielens\\zips\\states.csv")
zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}
us_regs_table=[
    ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),
    ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),
    ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),
    ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),
    ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),
    ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')
    ]
us_regs_table=[(x[0],[i.strip() for i in x[1].split(",")]) for x in us_regs_table]
us_regs_dct=dict()
for r in us_regs_table:
    for s in r[1]:
        us_regs_dct[zipcode_abbs_dct[s]]=r[0]
        
zipcode_info=pd.read_csv("D:\\Downloads\\movielens\\free-zipcode-database.csv")
zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()
zipcode_info['State'].loc[zipcode_info.Country!="US"]='UnknownOrNonUS'
zipcode_info['Region']=zipcode_info['State'].copy()
zipcode_info['Region'].loc[zipcode_info.Country=="US"]=zipcode_info.Region.loc[zipcode_info.Country=="US"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')
zipcode_info=zipcode_info[['Zipcode', 'Region']]
zipcode_info.head()

def process_zip(zp):
    try:
        zp=np.int(zp)
        return zp
    except:
        return np.nan

user_info["Zipcode"]=user_info.Zipcode.map(process_zip)
user_info=pd.merge(user_info,zipcode_info,on='Zipcode',how='left')
user_info['Region']=user_info.Region.fillna('UnknownOrNonUS')

user_info=pd.get_dummies(user_info[['UserId','Age','Gender','Occupation','Region']])
users_w_side_info=set(list(user_info.UserId))
ratings=ratings.loc[ratings.UserId.map(lambda x: x in users_w_side_info)]

user_info.head()

C:\Users\david\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[12]:

	UserId	Age	Gender_F	Gender_M	...	Occupation_technician	Occupation_writer	Region_Middle Atlantic	Region_Midwest	Region_South	Region_Southwest	Region_West
0	1	24	0	1	...	1	0	0	0	0	1	0
1	2	53	1	0	...	0	0	0	0	0	0	1
2	3	23	0	1	...	0	1	0	0	1	0	0
3	4	24	0	1	...	1	0	0	1	0	0	0
4	5	33	1	0	...	0	0	1	0	0	0	0

5 rows × 33 columns

3.3 Fitting and evaluating the full model¶

Adding explicit information gives the latent factors a more solid base, so fewer of them are needed the more side info there is available.

In [13]:

# Number of latent factors
k=30
k_main=5
k_genre=5
k_demo=5

# Regularization parameter
reg=50

# This time I'll weight the ratings matrix higher
w_main=4

# Fitting the model
rec3=CMF(k=k, k_main=k_main, k_item=k_genre, k_user=k_demo, w_main=w_main, reg_param=reg)
rec3.fit(train, genres, user_info, random_seed=32545)

# Making predictions
test['Predicted']=test.apply(lambda x: rec3.predict(x['UserId'],x['ItemId']),axis=1)

Same metrics as before:

In [14]:

np.sqrt(np.mean((test.Predicted-test.Rating)**2))

Out[14]:

1.2433900285807755

In [15]:

test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 4.062787136294028
Average rating for bottom-5 (non-)recommendations from this model: 3.120980091883614

This time the improvement was bigger and the Top-5 recommendations seem now to have increased by a bigger margin - with just adding the most basic demographic information!

4. Comparing recommendations¶

Now let's see what are of each these models recommending to some randomly picked users, along with the overall item popularity:

In [16]:

# aggregate statistics
avg_movie_rating=train.groupby('ItemId')['Rating'].mean()
num_ratings_per_movie=train.groupby('ItemId')['Rating'].agg(lambda x: len(tuple(x)))

# function to print recommended lists more nicely
def print_reclist(reclist):
    list_w_info=[str(m+1)+") - "+movie_id_to_title[reclist[m]]+\
        " - Average Rating: "+str(np.round(avg_movie_rating[reclist[m]],2))+\
        " - Number of ratings: "+str(num_ratings_per_movie[reclist[m]]) for m in range(len(reclist))]
    print("\n".join(list_w_info))

In [17]:

# user 1
reclist1=rec.top_n(UserId=1, n=20)
reclist2=rec2.top_n(UserId=1, n=20)
reclist3=rec3.top_n(UserId=1, n=20)

print('Recommendations from ratings-only model:')
print_reclist(reclist1)
print("------")
print('Recommendations from ratings + genre model:')
print_reclist(reclist2)
print("------")
print('Recommendations from ratings + genre + demographics model:')
print_reclist(reclist3)

Recommendations from ratings-only model:
1) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
2) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
3) - Toy Story (1995) - Average Rating: 3.92 - Number of ratings: 267
4) - Graduate, The (1967) - Average Rating: 4.19 - Number of ratings: 134
5) - Big Night (1996) - Average Rating: 3.86 - Number of ratings: 103
6) - Wrong Trousers, The (1993) - Average Rating: 4.59 - Number of ratings: 68
7) - Princess Bride, The (1987) - Average Rating: 4.23 - Number of ratings: 184
8) - 12 Angry Men (1957) - Average Rating: 4.3 - Number of ratings: 71
9) - Antonia's Line (1995) - Average Rating: 4.07 - Number of ratings: 43
10) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
11) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
12) - Chasing Amy (1997) - Average Rating: 3.77 - Number of ratings: 119
13) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
14) - Full Monty, The (1997) - Average Rating: 4.06 - Number of ratings: 135
15) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
16) - Empire Strikes Back, The (1980) - Average Rating: 4.24 - Number of ratings: 201
17) - Swingers (1996) - Average Rating: 3.93 - Number of ratings: 91
18) - Chasing Amy (1997) - Average Rating: 4.03 - Number of ratings: 68
19) - Blade Runner (1982) - Average Rating: 4.11 - Number of ratings: 151
20) - Nikita (La Femme Nikita) (1990) - Average Rating: 4.13 - Number of ratings: 61
------
Recommendations from ratings + genre model:
1) - Toy Story (1995) - Average Rating: 3.92 - Number of ratings: 267
2) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
3) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
4) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
5) - Princess Bride, The (1987) - Average Rating: 4.23 - Number of ratings: 184
6) - Big Night (1996) - Average Rating: 3.86 - Number of ratings: 103
7) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
8) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
9) - Graduate, The (1967) - Average Rating: 4.19 - Number of ratings: 134
10) - Wrong Trousers, The (1993) - Average Rating: 4.59 - Number of ratings: 68
11) - Full Monty, The (1997) - Average Rating: 4.06 - Number of ratings: 135
12) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
13) - Chasing Amy (1997) - Average Rating: 3.77 - Number of ratings: 119
14) - 12 Angry Men (1957) - Average Rating: 4.3 - Number of ratings: 71
15) - Antonia's Line (1995) - Average Rating: 4.07 - Number of ratings: 43
16) - Chasing Amy (1997) - Average Rating: 4.03 - Number of ratings: 68
17) - Empire Strikes Back, The (1980) - Average Rating: 4.24 - Number of ratings: 201
18) - Monty Python and the Holy Grail (1974) - Average Rating: 4.14 - Number of ratings: 183
19) - Return of the Jedi (1983) - Average Rating: 3.99 - Number of ratings: 300
20) - Welcome to the Dollhouse (1995) - Average Rating: 3.86 - Number of ratings: 69
------
Recommendations from ratings + genre + demographics model:
1) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
2) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
3) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
4) - Toy Story (1995) - Average Rating: 3.92 - Number of ratings: 267
5) - Wrong Trousers, The (1993) - Average Rating: 4.59 - Number of ratings: 68
6) - Chasing Amy (1997) - Average Rating: 3.77 - Number of ratings: 119
7) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
8) - Princess Bride, The (1987) - Average Rating: 4.23 - Number of ratings: 184
9) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
10) - Swingers (1996) - Average Rating: 3.93 - Number of ratings: 91
11) - Empire Strikes Back, The (1980) - Average Rating: 4.24 - Number of ratings: 201
12) - Big Night (1996) - Average Rating: 3.86 - Number of ratings: 103
13) - Full Monty, The (1997) - Average Rating: 4.06 - Number of ratings: 135
14) - 12 Angry Men (1957) - Average Rating: 4.3 - Number of ratings: 71
15) - Monty Python and the Holy Grail (1974) - Average Rating: 4.14 - Number of ratings: 183
16) - Pulp Fiction (1994) - Average Rating: 4.16 - Number of ratings: 225
17) - Chasing Amy (1997) - Average Rating: 4.03 - Number of ratings: 68
18) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
19) - Return of the Jedi (1983) - Average Rating: 3.99 - Number of ratings: 300
20) - Nikita (La Femme Nikita) (1990) - Average Rating: 4.13 - Number of ratings: 61

In [18]:

# user 943
reclist1=rec.top_n(UserId=943, n=20)
reclist2=rec2.top_n(UserId=943, n=20)
reclist3=rec3.top_n(UserId=943, n=20)

print_reclist(reclist1)
print('Recommendations from ratings-only model:')
print("------")
print('Recommendations from ratings + genre model:')
print_reclist(reclist2)
print("------")
print('Recommendations from ratings + genre + demographics model:')
print_reclist(reclist3)

1) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
2) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
3) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
4) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
5) - Courage Under Fire (1996) - Average Rating: 3.58 - Number of ratings: 137
6) - Rock, The (1996) - Average Rating: 3.72 - Number of ratings: 227
7) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
8) - Time to Kill, A (1996) - Average Rating: 3.67 - Number of ratings: 138
9) - Return of the Jedi (1983) - Average Rating: 3.99 - Number of ratings: 300
10) - People vs. Larry Flynt, The (1996) - Average Rating: 3.69 - Number of ratings: 123
11) - Trainspotting (1996) - Average Rating: 3.93 - Number of ratings: 164
12) - Mission: Impossible (1996) - Average Rating: 3.37 - Number of ratings: 209
13) - Apollo 13 (1995) - Average Rating: 3.92 - Number of ratings: 160
14) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
15) - Independence Day (ID4) (1996) - Average Rating: 3.47 - Number of ratings: 258
16) - Lone Star (1996) - Average Rating: 3.98 - Number of ratings: 119
17) - Rumble in the Bronx (1995) - Average Rating: 3.45 - Number of ratings: 116
18) - River Wild, The (1994) - Average Rating: 3.23 - Number of ratings: 84
19) - Truth About Cats & Dogs, The (1996) - Average Rating: 3.51 - Number of ratings: 170
20) - Broken Arrow (1996) - Average Rating: 3.04 - Number of ratings: 158
Recommendations from ratings-only model:
------
Recommendations from ratings + genre model:
1) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
2) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
3) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
4) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
5) - Courage Under Fire (1996) - Average Rating: 3.58 - Number of ratings: 137
6) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
7) - Rock, The (1996) - Average Rating: 3.72 - Number of ratings: 227
8) - Time to Kill, A (1996) - Average Rating: 3.67 - Number of ratings: 138
9) - Return of the Jedi (1983) - Average Rating: 3.99 - Number of ratings: 300
10) - Mission: Impossible (1996) - Average Rating: 3.37 - Number of ratings: 209
11) - People vs. Larry Flynt, The (1996) - Average Rating: 3.69 - Number of ratings: 123
12) - Trainspotting (1996) - Average Rating: 3.93 - Number of ratings: 164
13) - Apollo 13 (1995) - Average Rating: 3.92 - Number of ratings: 160
14) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
15) - Lone Star (1996) - Average Rating: 3.98 - Number of ratings: 119
16) - Independence Day (ID4) (1996) - Average Rating: 3.47 - Number of ratings: 258
17) - Rumble in the Bronx (1995) - Average Rating: 3.45 - Number of ratings: 116
18) - River Wild, The (1994) - Average Rating: 3.23 - Number of ratings: 84
19) - Broken Arrow (1996) - Average Rating: 3.04 - Number of ratings: 158
20) - Truth About Cats & Dogs, The (1996) - Average Rating: 3.51 - Number of ratings: 170
------
Recommendations from ratings + genre + demographics model:
1) - Godfather, The (1972) - Average Rating: 4.28 - Number of ratings: 237
2) - Star Wars (1977) - Average Rating: 4.34 - Number of ratings: 335
3) - Fargo (1996) - Average Rating: 4.23 - Number of ratings: 301
4) - Shawshank Redemption, The (1994) - Average Rating: 4.56 - Number of ratings: 174
5) - Rock, The (1996) - Average Rating: 3.72 - Number of ratings: 227
6) - Raiders of the Lost Ark (1981) - Average Rating: 4.32 - Number of ratings: 238
7) - Courage Under Fire (1996) - Average Rating: 3.58 - Number of ratings: 137
8) - Return of the Jedi (1983) - Average Rating: 3.99 - Number of ratings: 300
9) - Time to Kill, A (1996) - Average Rating: 3.67 - Number of ratings: 138
10) - Mission: Impossible (1996) - Average Rating: 3.37 - Number of ratings: 209
11) - Trainspotting (1996) - Average Rating: 3.93 - Number of ratings: 164
12) - Apollo 13 (1995) - Average Rating: 3.92 - Number of ratings: 160
13) - People vs. Larry Flynt, The (1996) - Average Rating: 3.69 - Number of ratings: 123
14) - Dead Man Walking (1995) - Average Rating: 3.94 - Number of ratings: 185
15) - Independence Day (ID4) (1996) - Average Rating: 3.47 - Number of ratings: 258
16) - Rumble in the Bronx (1995) - Average Rating: 3.45 - Number of ratings: 116
17) - Lone Star (1996) - Average Rating: 3.98 - Number of ratings: 119
18) - Truth About Cats & Dogs, The (1996) - Average Rating: 3.51 - Number of ratings: 170
19) - Broken Arrow (1996) - Average Rating: 3.04 - Number of ratings: 158
20) - Happy Gilmore (1996) - Average Rating: 3.24 - Number of ratings: 93

	ItemId	genre1	genre2	genre3	genre4	genre5	genre6	genre8	genre16
0	1	0	0	1	1	1	0	0	0
1	2	1	1	0	0	0	0	0	1
2	3	0	0	0	0	0	0	0	1
3	4	1	0	0	0	1	0	1	0
4	5	0	0	0	0	0	1	1	1

	UserId	Age	Gender_F	Gender_M	...	Occupation_technician	Occupation_writer	Region_Middle Atlantic	Region_Midwest	Region_South	Region_Southwest	Region_West
0	1	24	0	1	...	1	0	0	0	0	1	0
1	2	53	1	0	...	0	0	0	0	0	0	1
2	3	23	0	1	...	0	1	0	0	1	0	0
3	4	24	0	1	...	1	0	0	1	0	0	0
4	5	33	1	0	...	0	0	1	0	0	0	0

	ItemId	genre1	genre2	genre3	genre4	genre5	genre6	genre8	genre16
0	1	0	0	1	1	1	0	0	0
1	2	1	1	0	0	0	0	0	1
2	3	0	0	0	0	0	0	0	1
3	4	1	0	0	0	1	0	1	0
4	5	0	0	0	0	0	1	1	1

	UserId	Age	Gender_F	Gender_M	...	Occupation_technician	Occupation_writer	Region_Middle Atlantic	Region_Midwest	Region_South	Region_Southwest	Region_West
0	1	24	0	1	...	1	0	0	0	0	1	0
1	2	53	1	0	...	0	0	0	0	0	0	1
2	3	23	0	1	...	0	1	0	0	1	0	0
3	4	24	0	1	...	1	0	0	1	0	0	0
4	5	33	1	0	...	0	0	1	0	0	0	0

	ItemId	genre1	genre2	genre3	genre4	genre5	genre6	genre8	genre16
0	1	0	0	1	1	1	0	0	0
1	2	1	1	0	0	0	0	0	1
2	3	0	0	0	0	0	0	0	1
3	4	1	0	0	0	1	0	1	0
4	5	0	0	0	0	0	1	1	1

	UserId	Age	Gender_F	Gender_M	...	Occupation_technician	Occupation_writer	Region_Middle Atlantic	Region_Midwest	Region_South	Region_Southwest	Region_West
0	1	24	0	1	...	1	0	0	0	0	1	0
1	2	53	1	0	...	0	0	0	0	0	0	1
2	3	23	0	1	...	0	1	0	0	1	0	0
3	4	24	0	1	...	1	0	0	1	0	0	0
4	5	33	1	0	...	0	0	1	0	0	0	0