In this notebook, we'll use human-made music playlists to learn song embeddings. We'll treat a playlist as if it's a sentence and the songs it contains as words. We feed that to the word2vec algorithm which then learns embeddings for every song we have. These embeddings can then be used to recommend similar songs.
This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba). The dataset we'll use was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US.
!wget -q https://www.cs.cornell.edu/~shuochen/lme/dataset.tar.gz
!tar -xf dataset.tar.gz
import numpy as np
import pandas as pd
import gensim
from gensim.models import Word2Vec
from urllib import request
import warnings
warnings.filterwarnings('ignore')
with open("/content/dataset/yes_complete/train.txt", 'r') as f:
# skipping first 2 lines as they contain only metadata
lines = f.read().split('\n')[2:]
# select playlists with at least 2 songs, a minimum threshold for sequence learning
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])
Playlist #1: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] Playlist #2: ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']
Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:
model = Word2Vec(playlists, size=32, window=20, negative=50, min_count=1, workers=-1)
The model is now trained. Every song has an embedding. We only have song IDs, though, no titles or other info. Let's grab the song information file.
!head /content/dataset/yes_complete/song_hash.txt
0 Gucci Time (w\/ Swizz Beatz) Gucci Mane 1 Aston Martin Music (w\/ Drake & Chrisette Michelle) Rick Ross 2 Get Back Up (w\/ Chris Brown) T.I. 3 Hot Toddy (w\/ Jay-Z & Ester Dean) Usher 4 Whip My Hair Willow 5 Down On Me (w\/ 50 Cent) Jeremih 6 Black And Yellow Wiz Khalifa 7 Blowing Me Kisses Soulja Boy 8 Lay It Down Lloyd 9 Good For My Money (w\/ Lloyd) Baby Bash
with open("/content/dataset/yes_complete/song_hash.txt", 'r') as f:
songs_file = f.read().split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')
songs_df.head()
title | artist | |
---|---|---|
id | ||
0 | Gucci Time (w\/ Swizz Beatz) | Gucci Mane |
1 | Aston Martin Music (w\/ Drake & Chrisette Mich... | Rick Ross |
2 | Get Back Up (w\/ Chris Brown) | T.I. |
3 | Hot Toddy (w\/ Jay-Z & Ester Dean) | Usher |
4 | Whip My Hair | Willow |
songs_df.iloc[[1,10,100]]
title | artist | |
---|---|---|
id | ||
1 | Aston Martin Music (w\/ Drake & Chrisette Mich... | Rick Ross |
10 | Shake It | Elephant Man |
100 | I'm Yours | Jason Mraz |
songs_df[songs_df.artist == 'Rush'].head()
title | artist | |
---|---|---|
id | ||
1861 | Tom Sawyer | Rush |
2640 | Red Barchetta | Rush |
2655 | Fly By Night | Rush |
2691 | Freewill | Rush |
2748 | Limelight | Rush |
!head /content/dataset/yes_complete/tag_hash.txt
0, rock 1, pop 2, favorites 3, alternative 4, love 5, male vocalists 6, american 7, indie 8, classic rock 9, awesome
with open("/content/dataset/yes_complete/tag_hash.txt", 'r') as f:
tags_file = f.read().split('\n')
tags = [s.rstrip().split(',') for s in tags_file]
tag_name = {a:b.strip() for a,b in tags}
tag_name['#'] = 'no tag'
print('Tag name for tag id {} is "{}"\n'.format('10', tag_name['10']))
print('Tag name for tag id {} is "{}"\n'.format('80', tag_name['80']))
print('There are total {} tags'.format(len(tag_name.items())))
Tag name for tag id 10 is "jazz" Tag name for tag id 80 is "rhythm and blues" There are total 251 tags
!head /content/dataset/yes_complete/tags.txt
154 20 35 40 65 72 130 154 193 154 1 49 1 6 21 35 49 65 78 80 141 154 21 35 38 49 65 72 114 141 154 1 5 6 21 33 49 63 65 72 87 98 110 141 147 154 197 49 65 72 141 197 11 35 154 #
with open("/content/dataset/yes_complete/tags.txt", 'r') as f:
song_tags = f.read().split('\n')
song_tags = [s.split(' ') for s in song_tags]
song_tags = {a:b for a,b in enumerate(song_tags)}
def tags_for_song(song_id=0):
tag_ids = song_tags[int(song_id)]
return [tag_name[tag_id] for tag_id in tag_ids]
print('Tags for song "{}" : {}\n'.format(songs_df.iloc[0].title, tags_for_song(0)))
Tags for song "Gucci Time (w\/ Swizz Beatz)" : ['wjlb-fm']
def recommend(song_id=0, topn=5):
# song info
song_info = songs_df.iloc[song_id]
song_tags = [', '.join(tags_for_song(song_id))]
query_song = pd.DataFrame({'title':song_info.title,
'artist':song_info.artist,
'tags':song_tags})
# similar songs
similar_songs = np.array(model.wv.most_similar(positive=str(song_id), topn=topn))[:,0]
recommendations = songs_df.iloc[similar_songs]
recommendations['tags'] = [tags_for_song(i) for i in similar_songs]
recommendations = pd.concat([query_song, recommendations])
axis_name = ['Query'] + ['Recommendation '+str((i+1)) for i in range(topn)]
# recommendations.index = axis_name
recommendations = recommendations.style.set_table_styles([{'selector': 'th', 'props': [('background-color', 'gray')]}])
return recommendations
recs = recommend(10)
recs
title | artist | tags | |
---|---|---|---|
Query | Shake It | Elephant Man | no tag |
Recommendation 1 | Pudrete | Banda MS | ['no tag'] |
Recommendation 2 | Let Me Know | Roisin Murphy | ['rock', 'pop', 'favorites', 'love', 'female vocalists', '00s', 'dance', 'favourites', 'cool', 'chillout', 'electronic', 'sexy', 'british', 'upbeat', 'sad', 'seen live', 'indie pop', 'love it', 'electronica', 'female', 'good stuff', 'uk', 'lovely', 'disco', 'electro', 'favorite artists', '2007'] |
Recommendation 3 | In This Lifetime | The Psycho Realm | ['no tag'] |
Recommendation 4 | Take A Bow | Rihanna | ['pop', 'love', 'american', 'beautiful', 'soul', 'female vocalists', '00s', 'mellow', 'favorite', 'dance', 'favourites', 'cool', 'chillout', 'rnb', 'sexy', 'female vocalist', 'hip-hop', 'love songs', 'sad', 'hip hop', 'ballad', 'piano', 'memories', 'relaxing', 'love at first listen', 'female', 'r&b', 'slow', 'sweet', 'love song', 'soft', 'rb', 'r and b', 'emo', '<3', 'slow jams', 'major key tonality', 'guilty pleasures', 'emotional', '2008', 'a subtle use of vocal harmony', 'cute'] |
Recommendation 5 | Get Right | Jennifer Lopez | ['pop', 'favorites', 'love', 'american', 'soul', 'female vocalists', '00s', 'dance', 'singer-songwriter', '90s', 'favourites', 'cool', 'catchy', 'rnb', 'sexy', 'fun', 'party', 'happy', 'female vocalist', 'hip-hop', 'funk', 'upbeat', 'hip hop', 'female', 'funky', 'r&b', '2000s', 'latin', 'energetic', 'top 40', 'vocal', 'female vocals', 'english', 'urban', 'uplifting', 'r and b', 'hardcore', 'guilty pleasures', 'guilty pleasure', 'hiphop', 'new york', 'sing along', 'feelgood'] |
recommend(song_id=19563)
title | artist | tags | |
---|---|---|---|
0 | Paranoid Android | Radiohead | rock, pop, favorites, alternative, love, male vocalists, indie, classic rock, awesome, beautiful, mellow, alternative rock, favorite, chill, 90s, classic, favourites, chillout, indie rock, guitar, favorite songs, male vocalist, electronic, loved, british, favourite, soundtrack, amazing, sad, favourite songs, great song, ballad, melancholy, epic, experimental, psychedelic, memories, electronica, love at first listen, fucking awesome, progressive rock, great, best, nostalgia, melancholic, fav, good stuff, uk, great lyrics, ambient, perfect, psychedelic rock, dark, britpop, brilliant, alternative punk, progressive, emotional, masterpiece, best songs ever, rockin, genius, all time favourites, alt rock, 1990s |
43036 | Que Te Quieran Mas Que Yo | Marco Antonio Solis | ['no tag'] |
64157 | Paryer And Meditation | Jessica Williams | ['no tag'] |
65275 | Hallelujah Goat | Volbeat | ['rock', 'awesome', 'hard rock', 'metal', 'heavy metal', 'good', 'rock and roll'] |
66070 | You're My Christmas Present | Jimmy Beaumont & The Skyliners | ['christmas'] |
16550 | Jump Start | Nils | ['no tag'] |
recommend(song_id=842)
title | artist | tags | |
---|---|---|---|
0 | California Love (w\/ Dr. Dre & Roger Troutman) | 2Pac | favorites, love, oldies, dance, 90s, classic, loved, party, hip-hop, hip hop, rap, fav, old school, songs i absolutely love, hiphop, acclaimed music top 3000, california, 1990s |
20597 | Monk'n Around | Ryan Cohan | ['no tag'] |
41172 | Nadie Como Tu | Ramon Ayala Y Sus Bravos Del Norte | ['no tag'] |
44549 | Crash | The Primitives | ['rock', 'pop', 'favorites', 'alternative', 'indie', 'female vocalists', '80s', '90s', 'cool', 'catchy', 'party', 'favourite', 'happy', 'female vocalist', 'upbeat', 'soundtrack', 'indie pop', 'memories', 'female', 'new wave', 'uplifting', 'britpop', 'major key tonality', "80's", '1980s', 'acclaimed music top 3000', 'rockin'] |
53636 | Always There For You | Stryper | ['80s', 'hard rock', 'heavy metal', 'christian', 'christian rock'] |
48409 | - | - | ['no tag'] |
recommend(song_id=3822)
title | artist | tags | |
---|---|---|---|
0 | Billie Jean | Michael Jackson | rock, pop, favorites, alternative, love, male vocalists, american, classic rock, awesome, beautiful, soul, oldies, favorite, 80s, dance, singer-songwriter, 90s, classic, favourites, cool, 70s, catchy, favorite songs, rnb, male vocalist, electronic, sexy, loved, fun, party, favourite, pop rock, funk, amazing, usa, rhythm and blues, memories, funky, r&b, best, nostalgia, energetic, top 40, old school, nice, english, rb, urban, groovy, disco, perfect, guilty pleasures, 80's, brilliant, my favorites, guilty pleasure, 1980s, retro, masterpiece, motown, classics, best songs ever, classic soul, legend |
48164 | Heartbeat | Could Nothings | ['no tag'] |
30835 | Steve's Tune | Steve Lambert | ['no tag'] |
11901 | So Much Trouble In The World | Bob Marley & The Wailers | ['mellow', 'chill', 'classic', 'favourites', 'cool', '70s', 'favourite', 'sad', 'great song', 'summer', 'relax', 'best', 'drjazzmrfunkmusic', 'reggae', 'the best', 'legend'] |
70557 | Carmelita | Warren Zevon | ['rock', 'classic rock', 'mellow', 'singer-songwriter', '70s', 'hard rock', 'guitar', 'male vocalist', 'folk', 'acoustic', 'progressive rock', 'vocal', 'folk rock', 'americana', 'radioparadise', 'covers', 'perfect', 'radio paradise', 'my pop music', 'southern rock'] |
14550 | Heavy | Collective Soul | ['rock', 'pop', 'favorites', 'alternative', 'male vocalists', 'awesome', '00s', 'alternative rock', '90s', 'hard rock', 'loved', 'upbeat', 'great song', 'memories', 'progressive rock', 'grunge', 'faves', 'heavy', 'a subtle use of vocal harmony', 'favs'] |