Deep Learning For Coders with fastai & Pytorch- Collaborative Filtering Deep Dive - Recommender systems works differently than classic DL classifiers. They are mostly used for known data, no prediction expected based on previously unknown data like bear classifier do. Yes, there is a generalization process but still, all data is known by the model. What is not known is latent factors at the beginning of the training. The model learn these latent factors and the recommender is ready.
This my daughter at the IKEA very close to our home.
import fastbook
fastbook.setup_book()
from fastbook import *
%config Completer.use_jedi = False
Collaborative filtering modules:
from fastai.collab import *
from fastai.tabular.all import*
Downloading and extracting data from the URL list
path = untar_data(URLs.ML_100k)
Giving columns names and readind first five rows.
ratings = pd.read_csv(path/'u.data',delimiter = '\t', header= None, engine='python',names=['user','movie','rating','timestamp'])
ratings.head()
user | movie | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
How to recommend movies. Assume the movie has three properties, scince fiction(ness), action, old(ness).
Last skywalker is a sci-fi, and action and not old.
last_skywalker = np.array([0.98,0.9,-0.9])
And a user who likes sci-fi and action movies and not so old movies would like this.
user1= np.array([.9,.8,-.6])
If we multiply these two vectors and sum it. We get:
(user1*last_skywalker).sum()
2.1420000000000003
this our matching score, it is a positive value that shows there is a match between the movie and the user1
casablanka= np.array([-.99,-.33,.8])
(user1*casablanka).sum()
-1.635
this is low at this time. There is no match.
We can pick arbitrary number of parameters for the array. Above, we use three. That could be much more of them. We call them Latent Factors. We start training with random parameters and learn from the ratings given by users.
Note: It is not easy to use data as it was. For this dataset, movie id and movie title are not on the same table.
movies = pd.read_csv(path/'u.item', delimiter='|', engine= 'python',header=None,encoding='latin1', usecols=(0,1),names=('movie','title'))
movies.head()
movie | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
Let's bring ratings and movies together. (movie id will be the key parameter)
ratings=ratings.merge(movies)
ratings.head()
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
For Dataloaders, we use CollabDataLoaders
this Dataloader use first column for the user and second one for the item
, in our situation we should change the default one because our item
will be title
.
dls=CollabDataLoaders.from_df(ratings, item_name='title',bs=64)
dls.show_batch()
user | title | rating | |
---|---|---|---|
0 | 581 | Brassed Off (1996) | 3 |
1 | 864 | Jaws (1975) | 4 |
2 | 873 | Contact (1997) | 3 |
3 | 58 | Wings of Desire (1987) | 3 |
4 | 497 | Hard Target (1993) | 2 |
5 | 892 | Jungle Book, The (1994) | 4 |
6 | 43 | Santa Clause, The (1994) | 3 |
7 | 751 | Strictly Ballroom (1992) | 4 |
8 | 894 | Mighty Aphrodite (1995) | 4 |
9 | 390 | Spitfire Grill, The (1996) | 5 |
dls.classes['user'][:15]
(#15) ['#na#',1,2,3,4,5,6,7,8,9...]
dls.classes['title'][:15]
(#15) ['#na#',"'Til There Was You (1997)",'1-900 (1994)','101 Dalmatians (1996)','12 Angry Men (1957)','187 (1997)','2 Days in the Valley (1996)','20,000 Leagues Under the Sea (1954)','2001: A Space Odyssey (1968)','3 Ninjas: High Noon At Mega Mountain (1998)'...]
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5
user_factors = torch.randn(n_users,n_factors)
movie_factors = torch.randn(n_movies, n_factors)
Tip: This is how one_hot works.
one_hot(0,5)
tensor([1, 0, 0, 0, 0], dtype=torch.uint8)
one_hot_3 = one_hot(3,n_users).float()
one_hot_3[:10]
tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])
and multiply by users_factors(matrix multiplication)
user_factors.t() @ one_hot_3
tensor([ 0.4286, 0.8374, -0.5413, -1.6935, 0.1618])
This might look a bit daunting but it is not. Basically we want utilize pytorch more and python less. PyTorch very good at matrix multiplication, python is not. With this matrix multiplication we can access every index of the latent factor tensor in one move. Otherwise we would have use regular python loop and index which is very very slow.
This is Python version:
user_factors[3]
tensor([ 0.4286, 0.8374, -0.5413, -1.6935, 0.1618])
This is same. Great.
At this point there is a section regarding OOP if you want to learn OOP the check the original book page 260 (3rd release) or the course notebook
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return (users * movies).sum(dim=1)
Important: This forward method is a bit confusing but I guess what happens there is,
x
is merged df (it became part of the dls) from above so first column is user id and the second is movie id. check this part:
ratings=ratings.merge(movies)
ratings.head()
x,y = dls.one_batch()
x.shape
torch.Size([64, 2])
x[0]
tensor([804, 763])
first one is user id and the second is movie.
y[0]
tensor([5], dtype=torch.int8)
must be the rating.
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.382019 | 1.291539 | 00:06 |
1 | 1.064109 | 1.072716 | 00:06 |
2 | 0.977546 | 0.980324 | 00:06 |
3 | 0.869058 | 0.885319 | 00:06 |
4 | 0.803102 | 0.871484 | 00:07 |
not bad but we can force our model to make predictions into range 0-5
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
The dls has values in this range as dependent variables (ratings) and there is a special method in the fastai(I assume) for that.
doc(sigmoid_range)
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.991997 | 0.972494 | 00:08 |
1 | 0.856079 | 0.889023 | 00:08 |
2 | 0.677160 | 0.858434 | 00:07 |
3 | 0.455940 | 0.864097 | 00:07 |
4 | 0.371842 | 0.868755 | 00:07 |
A little bit better results.
Sometimes a user could give low (or high) ratings based on his/her subjective preference even the others thinks that is a very good movie. Let's add a net parameter for that is bias
. Bias effects all other parameters in negative or positive way.
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
res = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
return sigmoid_range(res, *self.y_range)
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.950526 | 0.923924 | 00:08 |
1 | 0.811043 | 0.851933 | 00:08 |
2 | 0.609098 | 0.852216 | 00:08 |
3 | 0.400987 | 0.877794 | 00:08 |
4 | 0.289632 | 0.884916 | 00:08 |
And the training loss goes down faster and faster, but valid loss not so.
from the book:
Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.**
Why would it prevent overfitting? The idea is that the larger the coefficients are, the sharper canyons we will have in the loss function. If we take the basic example of a parabola, y = a * (x**2)
, the larger a
is, the more narrow the parabola is (<
So, letting our model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting.
Limiting our weights from growing too much is going to hinder the training of the model, but it will yield a state where it generalizes better. Going back to the theory briefly, weight decay (or just wd
) is a parameter that controls that sum of squares we add to our loss (assuming parameters
is a tensor of all parameters):
loss_with_wd = loss + wd * (parameters**2).sum()
In practice, though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high school math, you might recall that the derivative of p**2
with respect to p
is 2*p
, so adding that big sum to our loss is exactly the same as doing:
parameters.grad += wd * 2 * parameters
In practice, since wd
is a parameter that we choose, we can just make it twice as big, so we don't even need the *2
in this equation. To use weight decay in fastai, just pass wd
in your call to fit
or fit_one_cycle
:
x = np.linspace(-2,2,100)
a_s = [1,2,5,10,50]
ys = [a * x**2 for a in a_s]
_,ax = plt.subplots(figsize=(8,6))
for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')
ax.set_ylim([0,5])
ax.legend();
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.977155 | 0.931246 | 00:08 |
1 | 0.870503 | 0.858914 | 00:07 |
2 | 0.746876 | 0.823343 | 00:07 |
3 | 0.573917 | 0.810015 | 00:07 |
4 | 0.480767 | 0.810702 | 00:07 |
Not so good traing loss but at this time validation loss is far better.
class T(Module):
def __init__(self): self.a = torch.ones(3)
L(T().parameters())
(#0) []
There is no pararameters, by its definition parameters must be trainable.
type(torch.ones(3)[0])
torch.Tensor
Note: this is tensor not parameter.So it is not trainable. No gradiend tracking.
from the book:
To tell Module
that we want to treat a tensor as a parameter, we have to wrap it in the nn.Parameter
class. This class doesn't actually add any functionality (other than automatically calling requires_grad_
for us). It's only used as a "marker" to show what to include in parameters
:
class T(Module):
def __init__(self): self.a = nn.Parameter(torch.ones(3))
L(T().parameters())
(#1) [Parameter containing: tensor([1., 1., 1.], requires_grad=True)]
and
class T(Module):
def __init__(self): self.a = nn.Linear(1, 3, bias=False)
t = T()
L(t.parameters())
(#1) [Parameter containing: tensor([[-0.0643], [-0.8105], [ 0.1346]], requires_grad=True)]
type(t.a.weight)
torch.nn.parameter.Parameter
Note: This is a parameter
type(t.a.weight.data)
torch.Tensor
Note: This is not.
Embedding
¶def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
doc(create_params)
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.925678 | 0.936475 | 00:08 |
1 | 0.820004 | 0.864908 | 00:08 |
2 | 0.718156 | 0.818959 | 00:08 |
3 | 0.589835 | 0.812693 | 00:08 |
4 | 0.462965 | 0.813873 | 00:08 |
Very similiar results.
Lowest biases in the model.
movie_bias=learn.model.movie_bias.squeeze()
idxs=movie_bias.argsort()[0:5]
[dls.classes['title'][i] for i in idxs]
['Children of the Corn: The Gathering (1996)', 'Crow: City of Angels, The (1996)', 'Jury Duty (1995)', 'Mortal Kombat: Annihilation (1997)', 'Cable Guy, The (1996)']
from the book:
Think about what this means. What it's saying is that for each of these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth), they still generally don't like it. We could have simply sorted the movies directly by their average rating, but looking at the learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people tend not to like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]
['Titanic (1997)', "Schindler's List (1993)", 'Star Wars (1977)', 'Shawshank Redemption, The (1994)', 'Rear Window (1954)']
from the book:
So, for instance, even if you don't normally enjoy detective movies, you might enjoy LA Confidential!
It is not quite so easy to directly interpret the embedding matrices. There are just too many factors for a human to look at. But there is a technique that can pull out the most important underlying directions in such a matrix, called principal component analysis (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course Computational Linear Algebra for Coders. <<img_pca_movie>> shows what our movies look like based on two of the strongest PCA components.
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()
Lets try changing X axis.
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac1[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()
Very interesting to study changes.
Same thing with fastai collab_learner
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.940206 | 0.939116 | 00:07 |
1 | 0.886674 | 0.867349 | 00:07 |
2 | 0.750853 | 0.824058 | 00:07 |
3 | 0.610644 | 0.811104 | 00:07 |
4 | 0.497868 | 0.812237 | 00:07 |
learn.model
EmbeddingDotBias( (u_weight): Embedding(944, 50) (i_weight): Embedding(1665, 50) (u_bias): Embedding(944, 1) (i_bias): Embedding(1665, 1) )
movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]
["Schindler's List (1993)", 'Titanic (1997)', 'Shawshank Redemption, The (1994)', 'Rear Window (1954)', 'Star Wars (1977)']
similar results.
Basically it means if two movies has similar latent factors.(embedding vector) This is the movie very similar latent factors with Silence of the lambs.
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]
'Body Snatcher, The (1945)'
read the all section from the original book at page 270 (3rd release) or the course notebook
First fastai could make a recommendation for right embedding sizes(latent factors).
embs = get_emb_sz(dls)
embs
[(944, 74), (1665, 102)]
class CollabNN(Module):
def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential(
nn.Linear(user_sz[1]+item_sz[1], n_act),
nn.ReLU(),
nn.Linear(n_act, 1))
self.y_range = y_range
def forward(self, x):
embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
x = self.layers(torch.cat(embs, dim=1))
return sigmoid_range(x, *self.y_range)
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.959750 | 0.937144 | 00:09 |
1 | 0.919930 | 0.894244 | 00:08 |
2 | 0.857974 | 0.870025 | 00:08 |
3 | 0.814399 | 0.854047 | 00:08 |
4 | 0.763636 | 0.860031 | 00:08 |
above is possibble(again) with collab_learner with one step. just use use_nn=True
.
learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.989683 | 0.959795 | 00:10 |
1 | 0.902582 | 0.904747 | 00:10 |
2 | 0.864139 | 0.879289 | 00:10 |
3 | 0.824376 | 0.847727 | 00:10 |
4 | 0.790679 | 0.850178 | 00:10 |
from the book:
Although the results of EmbeddingNN are a bit worse than the dot product approach (which shows the power of carefully constructing an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, date and time information, or any other information that may be relevant to the recommendation. That's exactly what TabularModel does. In fact, we've now seen that EmbeddingNN is just a TabularModel, with n_cont=0 and out_sz=1. So, we'd better spend some time learning about TabularModel, and how to use it to get great results! We'll do that in the next chapter.