Song Embeddings - Skipgram Recommender¶

In this notebook, we'll use human-made music playlists to learn song embeddings. We'll treat a playlist as if it's a sentence and the songs it contains as words. We feed that to the word2vec algorithm which then learns embeddings for every song we have. These embeddings can then be used to recommend similar songs.

toc: true
badges: true
comments: true
categories: [Word2Vec, Embedding, Music, Sequence]
author: "Jay Alammar"
image:

This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba). The dataset we'll use was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US.

Downloading data¶

In [ ]:

!wget -q https://www.cs.cornell.edu/~shuochen/lme/dataset.tar.gz
!tar -xf dataset.tar.gz

Setup¶

In [ ]:

import numpy as np
import pandas as pd
import gensim 
from gensim.models import Word2Vec
from urllib import request

In [ ]:

import warnings
warnings.filterwarnings('ignore')

Training dataset¶

In [ ]:

with open("/content/dataset/yes_complete/train.txt", 'r') as f:
  # skipping first 2 lines as they contain only metadata
  lines = f.read().split('\n')[2:]
  # select playlists with at least 2 songs, a minimum threshold for sequence learning 
  playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

In [ ]:

print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']

Training Word2vec¶

Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:

size: Embedding size for the songs.
window: word2vec algorithm parameter -- maximum distance between the current and predicted word (song) within a sentence
negative: word2vec algorithm parameter -- Number of negative examples to use at each training step that the model needs to identify as noise

In [ ]:

model = Word2Vec(playlists, size=32, window=20, negative=50, min_count=1, workers=-1)

The model is now trained. Every song has an embedding. We only have song IDs, though, no titles or other info. Let's grab the song information file.

Prepare songs metadata¶

Title and artist¶

In [ ]:

!head /content/dataset/yes_complete/song_hash.txt

0 	Gucci Time (w\/ Swizz Beatz)	Gucci Mane
1 	Aston Martin Music (w\/ Drake & Chrisette Michelle)	Rick Ross
2 	Get Back Up (w\/ Chris Brown)	T.I.
3 	Hot Toddy (w\/ Jay-Z & Ester Dean)	Usher
4 	Whip My Hair	Willow
5 	Down On Me (w\/ 50 Cent)	Jeremih
6 	Black And Yellow	Wiz Khalifa
7 	Blowing Me Kisses	Soulja Boy
8 	Lay It Down	Lloyd
9 	Good For My Money (w\/ Lloyd)	Baby Bash

In [ ]:

with open("/content/dataset/yes_complete/song_hash.txt", 'r') as f:
  songs_file = f.read().split('\n')
  songs = [s.rstrip().split('\t') for s in songs_file]

In [ ]:

songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')
songs_df.head()

Out[ ]:

	title	artist
id
0	Gucci Time (w\/ Swizz Beatz)	Gucci Mane
1	Aston Martin Music (w\/ Drake & Chrisette Mich...	Rick Ross
2	Get Back Up (w\/ Chris Brown)	T.I.
3	Hot Toddy (w\/ Jay-Z & Ester Dean)	Usher
4	Whip My Hair	Willow

In [ ]:

songs_df.iloc[[1,10,100]]

Out[ ]:

	title	artist
id
1	Aston Martin Music (w\/ Drake & Chrisette Mich...	Rick Ross
10	Shake It	Elephant Man
100	I'm Yours	Jason Mraz

In [ ]:

songs_df[songs_df.artist == 'Rush'].head()

Out[ ]:

	title	artist
id
1861	Tom Sawyer	Rush
2640	Red Barchetta	Rush
2655	Fly By Night	Rush
2691	Freewill	Rush
2748	Limelight	Rush

Tags¶

In [ ]:

!head /content/dataset/yes_complete/tag_hash.txt

0, rock
1, pop
2, favorites
3, alternative
4, love
5, male vocalists
6, american
7, indie
8, classic rock
9, awesome

In [ ]:

with open("/content/dataset/yes_complete/tag_hash.txt", 'r') as f:
  tags_file = f.read().split('\n')
  tags = [s.rstrip().split(',') for s in tags_file]
  tag_name = {a:b.strip() for a,b in tags}
  tag_name['#'] = 'no tag'

In [ ]:

print('Tag name for tag id {} is "{}"\n'.format('10', tag_name['10']))
print('Tag name for tag id {} is "{}"\n'.format('80', tag_name['80']))
print('There are total {} tags'.format(len(tag_name.items())))

Tag name for tag id 10 is "jazz"

Tag name for tag id 80 is "rhythm and blues"

There are total 251 tags

In [ ]:

!head /content/dataset/yes_complete/tags.txt

154
20 35 40 65 72 130 154 193
154
1 49
1 6 21 35 49 65 78 80 141 154
21 35 38 49 65 72 114 141 154
1 5 6 21 33 49 63 65 72 87 98 110 141 147 154 197
49 65 72 141 197
11 35 154
#

In [ ]:

with open("/content/dataset/yes_complete/tags.txt", 'r') as f:
  song_tags = f.read().split('\n')
  song_tags = [s.split(' ') for s in song_tags]
  song_tags = {a:b for a,b in enumerate(song_tags)}

In [ ]:

def tags_for_song(song_id=0):
  tag_ids = song_tags[int(song_id)]
  return [tag_name[tag_id] for tag_id in tag_ids]

In [ ]:

print('Tags for song "{}" : {}\n'.format(songs_df.iloc[0].title, tags_for_song(0)))

Tags for song "Gucci Time (w\/ Swizz Beatz)" : ['wjlb-fm']

In [ ]:

def recommend(song_id=0, topn=5):
  # song info
  song_info = songs_df.iloc[song_id]
  song_tags = [', '.join(tags_for_song(song_id))]
  query_song = pd.DataFrame({'title':song_info.title,
                             'artist':song_info.artist,
                             'tags':song_tags})

  # similar songs
  similar_songs = np.array(model.wv.most_similar(positive=str(song_id), topn=topn))[:,0]
  recommendations = songs_df.iloc[similar_songs]
  recommendations['tags'] = [tags_for_song(i) for i in similar_songs]

  recommendations = pd.concat([query_song, recommendations])

  axis_name = ['Query'] + ['Recommendation '+str((i+1)) for i in range(topn)]
  # recommendations.index = axis_name
  recommendations = recommendations.style.set_table_styles([{'selector': 'th', 'props': [('background-color', 'gray')]}])
  
  return recommendations

In [ ]:

recs = recommend(10)
recs

Out[ ]:

	title	artist	tags
Query	Shake It	Elephant Man	no tag
Recommendation 1	Pudrete	Banda MS	['no tag']
Recommendation 2	Let Me Know	Roisin Murphy	['rock', 'pop', 'favorites', 'love', 'female vocalists', '00s', 'dance', 'favourites', 'cool', 'chillout', 'electronic', 'sexy', 'british', 'upbeat', 'sad', 'seen live', 'indie pop', 'love it', 'electronica', 'female', 'good stuff', 'uk', 'lovely', 'disco', 'electro', 'favorite artists', '2007']
Recommendation 3	In This Lifetime	The Psycho Realm	['no tag']
Recommendation 4	Take A Bow	Rihanna	['pop', 'love', 'american', 'beautiful', 'soul', 'female vocalists', '00s', 'mellow', 'favorite', 'dance', 'favourites', 'cool', 'chillout', 'rnb', 'sexy', 'female vocalist', 'hip-hop', 'love songs', 'sad', 'hip hop', 'ballad', 'piano', 'memories', 'relaxing', 'love at first listen', 'female', 'r&b', 'slow', 'sweet', 'love song', 'soft', 'rb', 'r and b', 'emo', '<3', 'slow jams', 'major key tonality', 'guilty pleasures', 'emotional', '2008', 'a subtle use of vocal harmony', 'cute']
Recommendation 5	Get Right	Jennifer Lopez	['pop', 'favorites', 'love', 'american', 'soul', 'female vocalists', '00s', 'dance', 'singer-songwriter', '90s', 'favourites', 'cool', 'catchy', 'rnb', 'sexy', 'fun', 'party', 'happy', 'female vocalist', 'hip-hop', 'funk', 'upbeat', 'hip hop', 'female', 'funky', 'r&b', '2000s', 'latin', 'energetic', 'top 40', 'vocal', 'female vocals', 'english', 'urban', 'uplifting', 'r and b', 'hardcore', 'guilty pleasures', 'guilty pleasure', 'hiphop', 'new york', 'sing along', 'feelgood']

Paranoid Android - Radiohead¶

In [ ]:

recommend(song_id=19563)

Out[ ]:

	title	artist	tags
0	Paranoid Android	Radiohead	rock, pop, favorites, alternative, love, male vocalists, indie, classic rock, awesome, beautiful, mellow, alternative rock, favorite, chill, 90s, classic, favourites, chillout, indie rock, guitar, favorite songs, male vocalist, electronic, loved, british, favourite, soundtrack, amazing, sad, favourite songs, great song, ballad, melancholy, epic, experimental, psychedelic, memories, electronica, love at first listen, fucking awesome, progressive rock, great, best, nostalgia, melancholic, fav, good stuff, uk, great lyrics, ambient, perfect, psychedelic rock, dark, britpop, brilliant, alternative punk, progressive, emotional, masterpiece, best songs ever, rockin, genius, all time favourites, alt rock, 1990s
43036	Que Te Quieran Mas Que Yo	Marco Antonio Solis	['no tag']
64157	Paryer And Meditation	Jessica Williams	['no tag']
65275	Hallelujah Goat	Volbeat	['rock', 'awesome', 'hard rock', 'metal', 'heavy metal', 'good', 'rock and roll']
66070	You're My Christmas Present	Jimmy Beaumont & The Skyliners	['christmas']
16550	Jump Start	Nils	['no tag']

California Love - 2Pac¶

In [ ]:

recommend(song_id=842)

Out[ ]:

	title	artist	tags
0	California Love (w\/ Dr. Dre & Roger Troutman)	2Pac	favorites, love, oldies, dance, 90s, classic, loved, party, hip-hop, hip hop, rap, fav, old school, songs i absolutely love, hiphop, acclaimed music top 3000, california, 1990s
20597	Monk'n Around	Ryan Cohan	['no tag']
41172	Nadie Como Tu	Ramon Ayala Y Sus Bravos Del Norte	['no tag']
44549	Crash	The Primitives	['rock', 'pop', 'favorites', 'alternative', 'indie', 'female vocalists', '80s', '90s', 'cool', 'catchy', 'party', 'favourite', 'happy', 'female vocalist', 'upbeat', 'soundtrack', 'indie pop', 'memories', 'female', 'new wave', 'uplifting', 'britpop', 'major key tonality', "80's", '1980s', 'acclaimed music top 3000', 'rockin']
53636	Always There For You	Stryper	['80s', 'hard rock', 'heavy metal', 'christian', 'christian rock']
48409	-	-	['no tag']

Billie Jean - Michael Jackson¶

In [ ]:

recommend(song_id=3822)

Out[ ]:

	title	artist	tags
0	Billie Jean	Michael Jackson	rock, pop, favorites, alternative, love, male vocalists, american, classic rock, awesome, beautiful, soul, oldies, favorite, 80s, dance, singer-songwriter, 90s, classic, favourites, cool, 70s, catchy, favorite songs, rnb, male vocalist, electronic, sexy, loved, fun, party, favourite, pop rock, funk, amazing, usa, rhythm and blues, memories, funky, r&b, best, nostalgia, energetic, top 40, old school, nice, english, rb, urban, groovy, disco, perfect, guilty pleasures, 80's, brilliant, my favorites, guilty pleasure, 1980s, retro, masterpiece, motown, classics, best songs ever, classic soul, legend
48164	Heartbeat	Could Nothings	['no tag']
30835	Steve's Tune	Steve Lambert	['no tag']
11901	So Much Trouble In The World	Bob Marley & The Wailers	['mellow', 'chill', 'classic', 'favourites', 'cool', '70s', 'favourite', 'sad', 'great song', 'summer', 'relax', 'best', 'drjazzmrfunkmusic', 'reggae', 'the best', 'legend']
70557	Carmelita	Warren Zevon	['rock', 'classic rock', 'mellow', 'singer-songwriter', '70s', 'hard rock', 'guitar', 'male vocalist', 'folk', 'acoustic', 'progressive rock', 'vocal', 'folk rock', 'americana', 'radioparadise', 'covers', 'perfect', 'radio paradise', 'my pop music', 'southern rock']
14550	Heavy	Collective Soul	['rock', 'pop', 'favorites', 'alternative', 'male vocalists', 'awesome', '00s', 'alternative rock', '90s', 'hard rock', 'loved', 'upbeat', 'great song', 'memories', 'progressive rock', 'grunge', 'faves', 'heavy', 'a subtle use of vocal harmony', 'favs']

Song Embeddings - Skipgram Recommender¶

Downloading data¶

Setup¶

Training dataset¶

Training Word2vec¶

Prepare songs metadata¶

Title and artist¶

Tags¶

Recommend¶

Paranoid Android - Radiohead¶

California Love - 2Pac¶

Billie Jean - Michael Jackson¶