Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.
From raw_*.csv
, this notebook generates:
tracks.csv
: per-track / album / artist metadata.genres.csv
: genre hierarchy.echonest.csv
: cleaned Echonest features.A companion script, creation.py:
raw_tracks.csv
, raw_albums.csv
, raw_artists.csv
and raw_genres.csv
..zip
archives.import os
import ast
import pickle
import IPython.display as ipd
import numpy as np
import pandas as pd
import utils
import creation
AUDIO_DIR = os.environ.get('AUDIO_DIR')
BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))
FMA_FULL = os.path.join(BASE_DIR, 'fma_full')
FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')
.mp3
by HTTPS for each track id (only if we don't have it already).Todo:
track_image_file
, album_image_file
, artist_image_file
). Beware the quality.Dataset update:
# ./creation.py metadata
# ./creation.py data /path/to/fma/fma_full
# ./creation.py clips /path/to/fma
#!cat creation.py
# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('raw_tracks.csv', index_col=0)
albums = pd.read_csv('raw_albums.csv', index_col=0)
artists = pd.read_csv('raw_artists.csv', index_col=0)
genres = pd.read_csv('raw_genres.csv', index_col=0)
not_found = pickle.load(open('not_found.pickle', 'rb'))
def get_fs_tids(audio_dir):
tids = []
for _, dirnames, files in os.walk(audio_dir):
if dirnames == []:
tids.extend(int(file[:-4]) for file in files)
return tids
audio_tids = get_fs_tids(FMA_FULL)
clips_tids = get_fs_tids(FMA_LARGE)
print('tracks: {} collected ({} not found, {} max id)'.format(
len(tracks), len(not_found['tracks']), tracks.index.max()))
print('albums: {} collected ({} not found, {} in tracks)'.format(
len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))
print('artists: {} collected ({} not found, {} in tracks)'.format(
len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))
print('genres: {} collected'.format(len(genres)))
print('audio: {} collected ({} not found, {} not in tracks)'.format(
len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))
print('clips: {} collected ({} not found, {} not in tracks)'.format(
len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))
assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)
assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))
assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)
tracks: 109727 collected (45594 not found, 155320 max id) albums: 15234 collected (480 not found, 15714 in tracks) artists: 16916 collected (250 not found, 17166 in tracks) genres: 164 collected audio: 110668 collected (180 not found, 1121 not in tracks) clips: 109261 collected (286 not found, 0 not in tracks)
N = 5
ipd.display(tracks.head(N))
ipd.display(albums.head(N))
ipd.display(artists.head(N))
ipd.display(genres.head(N))
album_id | album_title | album_url | artist_id | artist_name | artist_url | artist_website | license_image_file | license_image_file_large | license_parent_id | ... | track_information | track_instrumental | track_interest | track_language_code | track_listens | track_lyricist | track_number | track_publisher | track_title | track_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_id | |||||||||||||||||||||
2 | 1.0 | AWOL - A Way Of Life | http://freemusicarchive.org/music/AWOL/AWOL_-_... | 1 | AWOL | http://freemusicarchive.org/music/AWOL/ | http://www.AzillionRecords.blogspot.com | http://i.creativecommons.org/l/by-nc-sa/3.0/us... | http://fma-files.s3.amazonaws.com/resources/im... | 5.0 | ... | NaN | 0 | 4656 | en | 1293 | NaN | 3 | NaN | Food | http://freemusicarchive.org/music/AWOL/AWOL_-_... |
3 | 1.0 | AWOL - A Way Of Life | http://freemusicarchive.org/music/AWOL/AWOL_-_... | 1 | AWOL | http://freemusicarchive.org/music/AWOL/ | http://www.AzillionRecords.blogspot.com | http://i.creativecommons.org/l/by-nc-sa/3.0/us... | http://fma-files.s3.amazonaws.com/resources/im... | 5.0 | ... | NaN | 0 | 1470 | en | 514 | NaN | 4 | NaN | Electric Ave | http://freemusicarchive.org/music/AWOL/AWOL_-_... |
5 | 1.0 | AWOL - A Way Of Life | http://freemusicarchive.org/music/AWOL/AWOL_-_... | 1 | AWOL | http://freemusicarchive.org/music/AWOL/ | http://www.AzillionRecords.blogspot.com | http://i.creativecommons.org/l/by-nc-sa/3.0/us... | http://fma-files.s3.amazonaws.com/resources/im... | 5.0 | ... | NaN | 0 | 1933 | en | 1151 | NaN | 6 | NaN | This World | http://freemusicarchive.org/music/AWOL/AWOL_-_... |
10 | 6.0 | Constant Hitmaker | http://freemusicarchive.org/music/Kurt_Vile/Co... | 6 | Kurt Vile | http://freemusicarchive.org/music/Kurt_Vile/ | http://kurtvile.com | http://i.creativecommons.org/l/by-nc-nd/3.0/88... | http://fma-files.s3.amazonaws.com/resources/im... | NaN | ... | NaN | 0 | 54881 | en | 50135 | NaN | 1 | NaN | Freeway | http://freemusicarchive.org/music/Kurt_Vile/Co... |
20 | 4.0 | Niris | http://freemusicarchive.org/music/Chris_and_Ni... | 4 | Nicky Cook | http://freemusicarchive.org/music/Chris_and_Ni... | NaN | http://i.creativecommons.org/l/by-nc-nd/3.0/88... | http://fma-files.s3.amazonaws.com/resources/im... | NaN | ... | NaN | 0 | 978 | en | 361 | NaN | 3 | NaN | Spiritual Level | http://freemusicarchive.org/music/Chris_and_Ni... |
5 rows × 38 columns
album_comments | album_date_created | album_date_released | album_engineer | album_favorites | album_handle | album_image_file | album_images | album_information | album_listens | album_producer | album_title | album_tracks | album_type | album_url | artist_name | artist_url | tags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
album_id | ||||||||||||||||||
1 | 0 | 11/26/2008 01:44:45 AM | 1/05/2009 | NaN | 4 | AWOL_-_A_Way_Of_Life | https://freemusicarchive.org/file/images/album... | [{'image_id': '1955', 'image_file': 'https://f... | <p></p> | 6073 | NaN | AWOL - A Way Of Life | 7 | Album | http://freemusicarchive.org/music/AWOL/AWOL_-_... | AWOL | http://freemusicarchive.org/music/AWOL/ | [] |
100 | 0 | 11/26/2008 01:55:44 AM | 1/09/2009 | NaN | 0 | On_Opaque_Things | https://freemusicarchive.org/file/images/album... | [{'image_id': '4403', 'image_file': 'https://f... | NaN | 5613 | NaN | On Opaque Things | 4 | Album | http://freemusicarchive.org/music/Bird_Names/O... | Bird Names | http://freemusicarchive.org/music/Bird_Names/ | [] |
1000 | 0 | 12/04/2008 09:28:49 AM | 10/26/2008 | NaN | 0 | DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F... | https://freemusicarchive.org/file/images/album... | [{'image_id': '31997', 'image_file': 'https://... | <p>http://blog.wfmu.org/freeform/2008/10/what-... | 1092 | NaN | DMBQ Live at 2008 Record Fair on WFMU Record F... | 4 | Live Performance | http://freemusicarchive.org/music/DMBQ/DMBQ_Li... | DMBQ | http://freemusicarchive.org/music/DMBQ/ | [] |
10000 | 0 | 9/05/2011 04:42:57 PM | NaN | NaN | 0 | Live_at_CKUT_on_Montreal_Sessions_1434 | https://freemusicarchive.org/file/images/album... | [{'image_id': '12266', 'image_file': 'https://... | <p>Live Set on the Montreal Session February 2... | 1001 | NaN | Live at CKUT on Montreal Sessions | 1 | Radio Program | http://freemusicarchive.org/music/Sundrips/Liv... | Sundrips | http://freemusicarchive.org/music/Sundrips/ | [] |
10001 | 0 | 9/06/2011 12:02:58 AM | 1/01/2006 | NaN | 0 | Grounds_Dream_Cosmic_Love | https://freemusicarchive.org/file/images/album... | [{'image_id': '24091', 'image_file': 'https://... | <p>Recorded in Linnavuori, Finland, 2005 (with... | 504 | NaN | Ground's Dream Cosmic Love | 1 | Album | http://freemusicarchive.org/music/Uton/Grounds... | Uton | http://freemusicarchive.org/music/Uton/ | [] |
artist_active_year_begin | artist_active_year_end | artist_associated_labels | artist_bio | artist_comments | artist_contact | artist_date_created | artist_donation_url | artist_favorites | artist_flattr_name | ... | artist_location | artist_longitude | artist_members | artist_name | artist_paypal_name | artist_related_projects | artist_url | artist_website | artist_wikipedia_page | tags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
artist_id | |||||||||||||||||||||
1 | 2006.0 | NaN | NaN | <p>A Way Of Life, A Collective of Hip-Hop from... | 0 | Brown Bum aka Choke | 11/26/2008 01:42:32 AM | NaN | 9 | NaN | ... | New Jersey | -74.405661 | Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... | AWOL | NaN | The list of past projects is 2 long but every1... | http://freemusicarchive.org/music/AWOL/ | http://www.AzillionRecords.blogspot.com | NaN | ['awol'] |
10 | NaN | NaN | Mistletone, Marriage Records | <p>"Lucky Dragons" means any recorded or perfo... | 3 | Lukey Dargons | 11/26/2008 01:43:35 AM | http://glaciersofnice.com/shop/ | 111 | NaN | ... | Los Angeles, CA | -118.243685 | Luke Fischbeck\nSarah Rara | Lucky Dragons | NaN | NaN | http://freemusicarchive.org/music/Lucky_Dragons/ | http://hawksandsparrows.org/ | NaN | ['lucky dragons'] |
100 | 2004.0 | NaN | Captcha Records (HBSP-2X), Pickled Egg (Europe) | <p><span style="font-family:Verdana, Geneva, A... | 1 | Chris Kalis | 11/26/2008 02:05:22 AM | NaN | 8 | NaN | ... | Chicago, IL | -87.629798 | Chris Kalis, Harry Brenner, Scott McGaughey, B... | Chandeliers | NaN | Killer Whales, \nMichael Columbia\nMandate\nMr... | http://freemusicarchive.org/music/Chandeliers/ | thechandeliers.com | NaN | ['chandeliers'] |
1000 | NaN | NaN | NaN | <p><a href="http://marzipanmarzipan.com">Marzi... | 0 | NaN | 12/04/2008 09:24:35 AM | NaN | 0 | NaN | ... | NaN | 12.567380 | NaN | Marzipan Marzipan | NaN | NaN | http://freemusicarchive.org/music/Marzipan_Mar... | https://soundcloud.com/marzipanmarzipan | NaN | [] |
10000 | NaN | NaN | NaN | <p><span style="font-family:'Times New Roman',... | 0 | NaN | 1/21/2011 02:11:31 PM | NaN | 1 | NaN | ... | NaN | NaN | Jack Hertz\nPHOBoS\nBlue Hell | Jack Hertz, PHOBoS, Blue Hell | NaN | NaN | http://freemusicarchive.org/music/Jack_Hertz_P... | http://surrism.phonoethics.com/surrism-phonoet... | NaN | ['jack hertz phobos blue hell'] |
5 rows × 24 columns
genre_color | genre_handle | genre_parent_id | genre_title | |
---|---|---|---|---|
genre_id | ||||
1 | #006666 | Avant-Garde | 38.0 | Avant-Garde |
2 | #CC3300 | International | NaN | International |
3 | #000099 | Blues | NaN | Blues |
4 | #990099 | Jazz | NaN | Jazz |
5 | #8A8A65 | Classical | NaN | Classical |
Todo:
artist_wikipedia_page
, remove html markup in free-form text.ffmpeg -i input.mp3 -f null -
df, column = tracks, 'tags'
null = sum(df[column].isnull())
print('{} null, {} non-null'.format(null, df.shape[0] - null))
df[column].value_counts().head(10)
0 null, 109727 non-null
[] 85881 ['interiors c1964', 'existential', 'hardcore-punk', 'pop-punk', 'punk-rock', 'internet boyfriend', 'rew starr', 'public domain', 'creative commons', 'microsong challenge'] 314 ['classwar karaoke'] 239 ['all styles experimental'] 215 ['improvisation', 'not normal music', 'all styles experimental'] 195 ['era 1'] 176 ['all styles experimental', 'harsh noise', 'not normal music'] 150 ['music is a belief', 'chary', 'nishad', 'uju', 'ibiene', 'nazeem', 'deepu', 'maneet', 'azedine', 'mohammad'] 140 ['new zealand'] 140 ['improvisation', 'all styles experimental', 'not normal music'] 128 Name: tags, dtype: int64
drop = [
'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url', # keep title only
'track_file', 'track_image_file', # used to download only
'track_url', 'album_url', 'artist_url', # only relevant on website
'track_copyright_c', 'track_copyright_p', # present for ~1000 tracks only
# 'track_composer', 'track_lyricist', 'track_publisher', # present for ~4000, <1000 and <2000 tracks
'track_disc_number', # different from 1 for <1000 tracks
'track_explicit', 'track_explicit_notes', # present for <4000 tracks
'track_instrumental' # ~6000 tracks have a 1, there is an instrumental genre
]
tracks.drop(drop, axis=1, inplace=True)
tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)
tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)
def convert_datetime(df, column, format=None):
df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)
convert_datetime(tracks, 'track_date_created')
convert_datetime(tracks, 'track_date_recorded')
tracks['album_id'].fillna(-1, inplace=True)
tracks['track_bit_rate'].fillna(-1, inplace=True)
tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})
def convert_genres(genres):
genres = ast.literal_eval(genres)
return [int(genre['genre_id']) for genre in genres]
tracks['track_genres'].fillna('[]', inplace=True)
tracks['track_genres'] = tracks['track_genres'].map(convert_genres)
tracks.columns
Index(['album_id', 'album_title', 'artist_id', 'artist_name', 'artist_website', 'track_license', 'track_tags', 'track_bit_rate', 'track_comments', 'track_composer', 'track_date_created', 'track_date_recorded', 'track_duration', 'track_favorites', 'track_genres', 'track_information', 'track_interest', 'track_language_code', 'track_listens', 'track_lyricist', 'track_number', 'track_publisher', 'track_title'], dtype='object')
drop = [
'artist_name', 'album_url', 'artist_url', # in tracks already (though it can be different)
'album_handle',
'album_image_file', 'album_images', # todo: shall be downloaded
#'album_producer', 'album_engineer', # present for ~2400 albums only
]
albums.drop(drop, axis=1, inplace=True)
albums.rename(columns={'tags': 'album_tags'}, inplace=True)
convert_datetime(albums, 'album_date_created')
convert_datetime(albums, 'album_date_released')
albums.columns
Index(['album_comments', 'album_date_created', 'album_date_released', 'album_engineer', 'album_favorites', 'album_information', 'album_listens', 'album_producer', 'album_title', 'album_tracks', 'album_type', 'album_tags'], dtype='object')
drop = [
'artist_website', 'artist_url', # in tracks already (though it can be different)
'artist_handle',
'artist_image_file', 'artist_images', # todo: shall be downloaded
'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name', # ~1600 & ~400 & ~70, not relevant
'artist_contact', # ~1500, not very useful data
# 'artist_active_year_begin', 'artist_active_year_end', # ~1400, ~500 only
# 'artist_associated_labels', # ~1000
# 'artist_related_projects', # only ~800, but can be combined with bio
]
artists.drop(drop, axis=1, inplace=True)
artists.rename(columns={'tags': 'artist_tags'}, inplace=True)
convert_datetime(artists, 'artist_date_created')
for column in ['artist_active_year_begin', 'artist_active_year_end']:
artists[column].replace(0.0, np.nan, inplace=True)
convert_datetime(artists, column, format='%Y.0')
artists.columns
Index(['artist_active_year_begin', 'artist_active_year_end', 'artist_associated_labels', 'artist_bio', 'artist_comments', 'artist_date_created', 'artist_favorites', 'artist_latitude', 'artist_location', 'artist_longitude', 'artist_members', 'artist_name', 'artist_related_projects', 'artist_wikipedia_page', 'artist_tags'], dtype='object')
not_found['albums'].remove(None)
not_found['albums'].append(-1)
not_found['albums'] = [int(i) for i in not_found['albums']]
not_found['artists'] = [int(i) for i in not_found['artists']]
tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))
n = sum(tracks['album_title_dup'].isnull())
print('{} tracks without extended album information ({} tracks without album_id)'.format(
n, sum(tracks['album_id'] == -1)))
assert sum(tracks['album_id'].isin(not_found['albums'])) == n
assert sum(tracks['album_title'] != tracks['album_title_dup']) == n
tracks.drop('album_title_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)
3674 tracks without extended album information (1041 tracks without album_id)
# Album artist can be different than track artist. Keep track artist.
#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)
tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))
n = sum(tracks['artist_name_dup'].isnull())
print('{} tracks without extended artist information'.format(n))
assert sum(tracks['artist_id'].isin(not_found['artists'])) == n
assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n
tracks.drop('artist_name_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)
974 tracks without extended artist information
columns = []
for name in tracks.columns:
names = name.split('_')
columns.append((names[0], '_'.join(names[1:])))
tracks.columns = pd.MultiIndex.from_tuples(columns)
assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))
# Todo: fill other columns ?
tracks['album', 'tags'].fillna('[]', inplace=True)
tracks['artist', 'tags'].fillna('[]', inplace=True)
columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),
('artist', 'favorites'), ('artist', 'comments')]
for column in columns:
tracks[column].fillna(-1, inplace=True)
columns = {column: int for column in columns}
tracks = tracks.astype(columns)
Todo: duplicates (metadata and audio)
def keep(index, df):
old = len(df)
df = df.loc[index]
new = len(df)
print('{} lost, {} left'.format(old - new, new))
return df
tracks = keep(tracks.index, tracks)
0 lost, 109727 left
# Audio not found or could not be trimmed.
tracks = keep(tracks.index.difference(not_found['audio']), tracks)
tracks = keep(tracks.index.difference(not_found['clips']), tracks)
180 lost, 109547 left 286 lost, 109261 left
Errors from the features.py
script.
# Feature extraction failed.
FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,
29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,
62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,
92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,
92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,
115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,
144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,
154923]
tracks = keep(tracks.index.difference(FAILED), tracks)
71 lost, 109190 left
# License forbids redistribution.
tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)
print('{} licenses'.format(len(tracks[('track', 'license')].unique())))
2616 lost, 106574 left 114 licenses
#sum(tracks['track', 'title'].duplicated())
genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)
genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)
genres['parent'].fillna(0, inplace=True)
genres = genres.astype({'parent': int})
# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu
genres.at[13, 'parent'] = 0
# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website
genres.at[580, 'parent'] = 21
# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website
genres.at[810, 'parent'] = 13
# 763 (Holiday) has parent 763 which is itself
# --> listed as child of Sound Effects on website
genres.at[763, 'parent'] = 16
# Todo: should novelty be under Experimental? It is alone on website.
# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).
print('{} tracks have genre 806'.format(
sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))
def change_genre(genres):
return [genre if genre != 806 else 21 for genre in genres]
tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)
genres.drop(806, inplace=True)
34 tracks have genre 806
def get_parent(genre, track_all_genres=None):
parent = genres.at[genre, 'parent']
if track_all_genres is not None:
track_all_genres.append(genre)
return genre if parent == 0 else get_parent(parent, track_all_genres)
# Get all genres, i.e. all genres encountered when walking from leafs to roots.
def get_all_genres(track_genres):
track_all_genres = list()
for genre in track_genres:
get_parent(genre, track_all_genres)
return list(set(track_all_genres))
tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)
# Number of tracks per genre.
def count_genres(subset=tracks.index):
count = pd.Series(0, index=genres.index)
for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():
for genre in track_all_genres:
count[genre] += 1
return count
genres['#tracks'] = count_genres()
genres[genres['#tracks'] == 0]
parent | title | #tracks | |
---|---|---|---|
genre_id | |||
175 | 86 | Bollywood | 0 |
178 | 4 | Be-Bop | 0 |
def get_top_genre(track_genres):
top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)
return top_genres.pop() if len(top_genres) == 1 else np.nan
# Top-level genre.
genres['top_level'] = genres.index.map(get_parent)
tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)
genres.head(10)
parent | title | #tracks | top_level | |
---|---|---|---|---|
genre_id | ||||
1 | 38 | Avant-Garde | 8693 | 38 |
2 | 0 | International | 5271 | 2 |
3 | 0 | Blues | 1752 | 3 |
4 | 0 | Jazz | 4126 | 4 |
5 | 0 | Classical | 4106 | 5 |
6 | 38 | Novelty | 914 | 38 |
7 | 20 | Comedy | 217 | 20 |
8 | 0 | Old-Time / Historic | 868 | 8 |
9 | 0 | Country | 1987 | 9 |
10 | 0 | Pop | 13845 | 10 |
Main characteristic: the full set with clips trimmed to a manageable size.
Main characteristic: clean metadata (includes 1 top-level genre) and quality audio.
fma_medium = pd.DataFrame(tracks)
# Missing meta-information.
# Missing extended album and artist information.
fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)
fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)
# Untitled track or album.
fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)
fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
# One tag is often just the artist name. Tags too scarce for tracks and albums.
#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)
# Too scarce.
#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)
# Too scarce.
#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)
3529 lost, 103045 left 598 lost, 102447 left 1 lost, 102446 left 674 lost, 101772 left 65 lost, 101707 left
# Technical quality.
# Todo: sample rate
fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)
# Choosing standard bit rates discards all VBR.
#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)
1326 lost, 100381 left
fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)
fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)
4736 lost, 95645 left 5399 lost, 90246 left 466 lost, 89780 left 5353 lost, 84427 left
# Lower popularity bound.
fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)
fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)
fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);
# Favorites and comments are very scarce.
#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)
4941 lost, 79486 left 1064 lost, 78422 left 1769 lost, 76653 left
# Targeted genre classification.
fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);
#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);
42495 lost, 34158 left
# Adjust size with popularity measure. Should be of better quality.
N_TRACKS = 25000
# Observations
# * More albums killed than artists --> be sure not to kill diversity
# * Favorites and preterites genres differently --> do it per genre?
# Normalization
# * mean, median, std, max
# * tracks per album or artist
# Test
# * 4/5 of same tracks were selected with various set of measures
# * <5% diff with max and mean
popularity_measures = [('track', 'listens'), ('track', 'interest')] # ('album', 'listens')
# ('track', 'favorites'), ('track', 'comments'),
# ('album', 'favorites'), ('album', 'comments'),
# ('artist', 'favorites'), ('artist', 'comments'),
normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}
def popularity_measure(track):
return sum(track[measure] / normalization[measure] for measure in popularity_measures)
fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)
fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)
9158 lost, 25000 left
tmp = genres[genres['parent'] == 0].reset_index().set_index('title')
tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()
tmp.sort_values('#tracks_medium', ascending=False)
genre_id | parent | #tracks | top_level | #tracks_medium | |
---|---|---|---|---|---|
title | |||||
Rock | 12 | 0 | 32923 | 12 | 7103 |
Electronic | 15 | 0 | 34413 | 15 | 6314 |
Experimental | 38 | 0 | 38154 | 38 | 2251 |
Hip-Hop | 21 | 0 | 8389 | 21 | 2201 |
Folk | 17 | 0 | 12706 | 17 | 1519 |
Instrumental | 1235 | 0 | 14938 | 1235 | 1350 |
Pop | 10 | 0 | 13845 | 10 | 1186 |
International | 2 | 0 | 5271 | 2 | 1018 |
Classical | 5 | 0 | 4106 | 5 | 619 |
Old-Time / Historic | 8 | 0 | 868 | 8 | 510 |
Jazz | 4 | 0 | 4126 | 4 | 384 |
Country | 9 | 0 | 1987 | 9 | 178 |
Soul-RnB | 14 | 0 | 1499 | 14 | 154 |
Spoken | 20 | 0 | 1876 | 20 | 118 |
Blues | 3 | 0 | 1752 | 3 | 74 |
Easy Listening | 13 | 0 | 730 | 13 | 21 |
Main characteristic: genre balanced (and echonest features).
Choices:
Todo:
N_GENRES = 8
N_TRACKS = 1000
top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index
fma_small = pd.DataFrame(fma_medium)
fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)
2058 lost, 22942 left
to_keep = []
for genre in top_genres:
subset = fma_small[fma_small['track', 'genre_top'] == genre]
drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]
fma_small.drop(drop, inplace=True)
assert len(fma_small) == N_GENRES * N_TRACKS
SUBSETS = ('small', 'medium', 'large')
tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)
tracks.loc[tracks.index, ('set', 'subset')] = 'large'
tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'
tracks.loc[fma_small.index, ('set', 'subset')] = 'small'
echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])
echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)
echonest = keep(echonest.index.isin(tracks.index), echonest);
keep(echonest.index.isin(fma_medium.index), echonest);
keep(echonest.index.isin(fma_small.index), echonest);
0 lost, 14511 left 205 lost, 14306 left 239 lost, 14067 left 938 lost, 13129 left 7848 lost, 5281 left 11835 lost, 1294 left
Take into account:
for genre in genres.index:
tracks['genre', genres.at[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)
SPLITS = ('training', 'test', 'validation')
PERCENTAGES = (0.8, 0.1, 0.1)
tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)
for subset in SUBSETS:
tracks_subset = tracks['set', 'subset'] <= subset
# Consider only top-level genres for small and medium.
genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())
if subset == 'large':
genre_list = list(genres['title'])
while True:
if len(genre_list) == 0:
break
# Choose most constrained genre, i.e. genre with the least unassigned artists.
tracks_unsplit = tracks['set', 'split'].isnull()
count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']
count = count.groupby(level=1).sum().astype(np.bool).sum()
genre = np.argmin(count[genre_list])
genre_list.remove(genre)
# Given genre, select artists.
tracks_genre = tracks['genre', genre] == 1
artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()
#print('-->', genre, len(artists))
current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}
# Assign artists with most tracks first.
for artist, count in artists.items():
choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])
current[SPLITS[choice]] += count
#assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()
tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]
# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.
no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)
no_split = tracks['set', 'split'].isnull()
assert not (no_split & ~no_genres).any()
tracks.loc[no_split, ('set', 'split')] = 'training'
# Not needed any more.
tracks.drop('genre', axis=1, level=0, inplace=True)
for dataset in 'tracks', 'genres', 'echonest':
eval(dataset).sort_index(axis=0, inplace=True)
eval(dataset).sort_index(axis=1, inplace=True)
params = dict(float_format='%.10f') if dataset == 'echonest' else dict()
eval(dataset).to_csv(dataset + '.csv', **params)
# ./creation.py normalize /path/to/fma
# ./creation.py zips /path/to/fma
tracks = utils.load('tracks.csv')
tracks.dtypes
album comments int64 date_created datetime64[ns] date_released datetime64[ns] engineer object favorites int64 id int64 information category listens int64 producer object tags object title object tracks int64 type category artist active_year_begin datetime64[ns] active_year_end datetime64[ns] associated_labels object bio category comments int64 date_created datetime64[ns] favorites int64 id int64 latitude float64 location object longitude float64 members object name object related_projects object tags object website object wikipedia_page object set split object subset category track bit_rate int64 comments int64 composer object date_created datetime64[ns] date_recorded datetime64[ns] duration int64 favorites int64 genre_top category genres object genres_all object information object interest int64 language_code object license category listens int64 lyricist object number int64 publisher object tags object title object dtype: object
N = 5
ipd.display(tracks['track'].head(N))
ipd.display(tracks['album'].head(N))
ipd.display(tracks['artist'].head(N))
bit_rate | comments | composer | date_created | date_recorded | duration | favorites | genre_top | genres | genres_all | information | interest | language_code | license | listens | lyricist | number | publisher | tags | title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_id | ||||||||||||||||||||
2 | 256000 | 0 | NaN | 2008-11-26 01:48:12 | 2008-11-26 | 168 | 2 | Hip-Hop | [21] | [21] | NaN | 4656 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 1293 | NaN | 3 | NaN | [] | Food |
3 | 256000 | 0 | NaN | 2008-11-26 01:48:14 | 2008-11-26 | 237 | 1 | Hip-Hop | [21] | [21] | NaN | 1470 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 514 | NaN | 4 | NaN | [] | Electric Ave |
5 | 256000 | 0 | NaN | 2008-11-26 01:48:20 | 2008-11-26 | 206 | 6 | Hip-Hop | [21] | [21] | NaN | 1933 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 1151 | NaN | 6 | NaN | [] | This World |
10 | 192000 | 0 | Kurt Vile | 2008-11-25 17:49:06 | 2008-11-26 | 161 | 178 | Pop | [10] | [10] | NaN | 54881 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 50135 | NaN | 1 | NaN | [] | Freeway |
20 | 256000 | 0 | NaN | 2008-11-26 01:48:56 | 2008-01-01 | 311 | 0 | NaN | [76, 103] | [17, 10, 76, 103] | NaN | 978 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 361 | NaN | 3 | NaN | [] | Spiritual Level |
comments | date_created | date_released | engineer | favorites | id | information | listens | producer | tags | title | tracks | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_id | |||||||||||||
2 | 0 | 2008-11-26 01:44:45 | 2009-01-05 | NaN | 4 | 1 | <p></p> | 6073 | NaN | [] | AWOL - A Way Of Life | 7 | Album |
3 | 0 | 2008-11-26 01:44:45 | 2009-01-05 | NaN | 4 | 1 | <p></p> | 6073 | NaN | [] | AWOL - A Way Of Life | 7 | Album |
5 | 0 | 2008-11-26 01:44:45 | 2009-01-05 | NaN | 4 | 1 | <p></p> | 6073 | NaN | [] | AWOL - A Way Of Life | 7 | Album |
10 | 0 | 2008-11-26 01:45:08 | 2008-02-06 | NaN | 4 | 6 | NaN | 47632 | NaN | [] | Constant Hitmaker | 2 | Album |
20 | 0 | 2008-11-26 01:45:05 | 2009-01-06 | NaN | 2 | 4 | <p> "spiritual songs" from Nicky Cook</p> | 2710 | NaN | [] | Niris | 13 | Album |
active_year_begin | active_year_end | associated_labels | bio | comments | date_created | favorites | id | latitude | location | longitude | members | name | related_projects | tags | website | wikipedia_page | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_id | |||||||||||||||||
2 | 2006-01-01 | NaT | NaN | <p>A Way Of Life, A Collective of Hip-Hop from... | 0 | 2008-11-26 01:42:32 | 9 | 1 | 40.058324 | New Jersey | -74.405661 | Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... | AWOL | The list of past projects is 2 long but every1... | [awol] | http://www.AzillionRecords.blogspot.com | NaN |
3 | 2006-01-01 | NaT | NaN | <p>A Way Of Life, A Collective of Hip-Hop from... | 0 | 2008-11-26 01:42:32 | 9 | 1 | 40.058324 | New Jersey | -74.405661 | Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... | AWOL | The list of past projects is 2 long but every1... | [awol] | http://www.AzillionRecords.blogspot.com | NaN |
5 | 2006-01-01 | NaT | NaN | <p>A Way Of Life, A Collective of Hip-Hop from... | 0 | 2008-11-26 01:42:32 | 9 | 1 | 40.058324 | New Jersey | -74.405661 | Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... | AWOL | The list of past projects is 2 long but every1... | [awol] | http://www.AzillionRecords.blogspot.com | NaN |
10 | NaT | NaT | Mexican Summer, Richie Records, Woodsist, Skul... | <p><span style="font-family:Verdana, Geneva, A... | 3 | 2008-11-26 01:42:55 | 74 | 6 | NaN | NaN | NaN | Kurt Vile, the Violators | Kurt Vile | NaN | [philly, kurt vile] | http://kurtvile.com | NaN |
20 | 1990-01-01 | 2011-01-01 | NaN | <p>Songs written by: Nicky Cook</p>\n<p>VOCALS... | 2 | 2008-11-26 01:42:52 | 10 | 4 | 51.895927 | Colchester England | 0.891874 | Nicky Cook\n | Nicky Cook | NaN | [instrumentals, experimental pop, post punk, e... | NaN | NaN |