movielens_df: pd.DataFrame = load_movielens()
movielens_df.head(5)
user_id | movie_title | rating | |
---|---|---|---|
36649 | User 742 | Jerry Maguire (1996) | 4 |
2478 | User 908 | Usual Suspects, The (1995) | 3 |
82838 | User 758 | Real Genius (1985) | 4 |
69729 | User 393 | Things to Do in Denver when You're Dead (1995) | 3 |
36560 | User 66 | Jerry Maguire (1996) | 4 |
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split
# Step 1: create a Reader.
# A reader tells our SVD what the lower and upper bound of our ratings is.
# MovieLens ratings are from 1 to 5
reader = Reader(rating_scale=(1, 5))
# Step 2: create a new Dataset instance with a DataFrame and the reader
# The DataFrame needs to have 3 columns in this specific order: [user_id, product_id, rating]
data = Dataset.load_from_df(movielens_df, reader)
# Step 3: keep 25% of your trainset for testing
trainset, testset = train_test_split(data, test_size=.25)
# Step 4: train a new SVD with 100 latent features (number was chosen arbitrarily)
model = SVD(n_factors=100)
model.fit(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1130ead68>
Surprise SVD stores the product matrix under the model.qi
attribute.
model.qi.shape
(596, 100)
The matrix has n_factors
columns (we chose 100). Every row represents a movie
Every row is mapped to a movie. How do we map every movie to it's vector?
item_to_row_idx: Dict[Any, int] = model.trainset._raw2inner_id_items
# `display()` is a utility function to make `item_to_row_idx` more readable
display(item_to_row_idx)
model.qi row idx | |
---|---|
Movie name | |
Lion King, The (1994) | 0 |
African Queen, The (1951) | 1 |
Day the Earth Stood Still, The (1951) | 2 |
Fried Green Tomatoes (1991) | 3 |
Blues Brothers, The (1980) | 4 |
toy_story_row_idx : int = item_to_row_idx['Toy Story (1995)']
model.qi[toy_story_row_idx]
array([-0.00889267, -0.03901101, -0.19582206, -0.06800691, 0.11612643, -0.0133471 , -0.0067134 , 0.00288335, 0.18905863, -0.01727417, -0.05463992, 0.03962723, -0.01882104, 0.01020398, -0.02117866, 0.16177179, -0.04796802, 0.01428753, 0.13078113, -0.02725028, 0.12102731, 0.07361403, -0.03889315, 0.21971317, 0.10844565, -0.02779188, -0.06676929, 0.06646453, -0.00768229, -0.14992161, -0.07929755, 0.00377584, -0.18182449, -0.07932236, 0.0837675 , -0.08436358, 0.10939826, -0.21550487, -0.00997129, -0.14068558, -0.07365779, -0.06704182, 0.01132891, 0.10421864, 0.11748961, 0.07426254, 0.09342114, 0.01356848, -0.0250024 , 0.12239668, -0.20936433, -0.22866096, -0.04916814, 0.0842263 , -0.1353041 , -0.03717908, -0.17404182, 0.02941116, 0.04993152, 0.06490656, -0.05549422, -0.10358558, 0.00789368, 0.09439441, -0.07726498, -0.08448086, 0.08246883, 0.17941641, 0.01990596, -0.02759331, 0.06862457, -0.12098117, -0.03077882, 0.08178186, 0.10700504, -0.01529634, -0.00385706, 0.04940254, 0.28700017, -0.0197356 , 0.02827431, 0.13303162, -0.05905182, -0.0673481 , 0.0471547 , -0.01943226, 0.09228729, 0.12408544, 0.07230831, 0.09700075, -0.14674701, 0.03890628, 0.00311309, -0.02259477, 0.00057669, -0.01448026, -0.00467238, -0.20787822, -0.19006575, 0.05603329])
print(f"Every product has {model.qi[toy_story_row_idx].shape[0]} features")
Every product has 100 features
Computes the rating prediction for given user and movie with model.predict()
. Pick a random user and movie, and calculate the score between them
# Refresher: ratings data-frame.
movielens_df.head(2)
user_id | movie_title | rating | |
---|---|---|---|
49469 | User 437 | Monty Python and the Holy Grail (1974) | 3 |
12181 | User 85 | Butch Cassidy and the Sundance Kid (1969) | 4 |
a_user = "User 196"
a_product = "Toy Story (1995)"
model.predict(a_user, a_product)
Prediction(uid='User 196', iid='Toy Story (1995)', r_ui=None, est=4.103242838730761, details={'was_impossible': False})
2 products are "similar" when the cosine distance is close to 0
# Fetch indices for Toy Story and Wizard of Oz
starwars_idx = model.trainset._raw2inner_id_items['Star Wars (1977)']
roj_idx = model.trainset._raw2inner_id_items['Return of the Jedi (1983)']
aladdin_idx = model.trainset._raw2inner_id_items['Aladdin (1992)']
# Get vectors for both movies
starwars_vector = model.qi[starwars_idx]
return_of_jedi_vector = model.qi[roj_idx]
aladdin_vector = model.qi[aladdin_idx]
# Distance between Starwars and Return of the Jedi
cosine_distance(starwars_vector, return_of_jedi_vector)
0.29566718216988797
# Distance between Starwars and Aladdin
cosine_distance(starwars_vector, aladdin_vector)
0.8587662155892206
def get_top_similarities(movie_title: str, model: SVD) -> pd.DataFrame:
"""Returns the top 5 most similar movies to a specified movie"""
...
get_top_similarities('Star Wars (1977)', model)
vector cosine distance | movie title | |
---|---|---|
0 | 0.000000 | Star Wars (1977) |
1 | 0.262668 | Empire Strikes Back, The (1980) |
2 | 0.295667 | Return of the Jedi (1983) |
3 | 0.435423 | Raiders of the Lost Ark (1981) |
get_top_similarities('Pulp Fiction (1994)', model)
vector cosine distance | movie title | |
---|---|---|
0 | 0.000000 | Pulp Fiction (1994) |
1 | 0.514664 | Ed Wood (1994) |
2 | 0.658022 | Trainspotting (1996) |
3 | 0.659555 | From Dusk Till Dawn (1996) |
SVD is a really powerful technique for providing recommendations
Latent features can be used in many different ways
Once the latent features are generated, collaborative filtering becomes entirely platform agnostic. The vectors are very portable
Surprise has a really low barrier of entry.