FMA: A Dataset For Music Analysis ¶

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

Creation¶

From raw_*.csv, this notebook generates:

tracks.csv: per-track / album / artist metadata.
genres.csv: genre hierarchy.
echonest.csv: cleaned Echonest features.

A companion script, creation.py:

Query the API and store metadata in raw_tracks.csv, raw_albums.csv, raw_artists.csv and raw_genres.csv.
Download the audio for each track.
Trim the audio to 30s clips.
Normalize the permissions and modification / access times.
Create the .zip archives.

In [1]:

import os
import ast
import pickle

import IPython.display as ipd
import numpy as np
import pandas as pd

import utils
import creation

In [2]:

AUDIO_DIR = os.environ.get('AUDIO_DIR')
BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))
FMA_FULL = os.path.join(BASE_DIR, 'fma_full')
FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')

1 Retrieve metadata and audio from FMA¶

Crawl the tracks, albums and artists metadata through their API.
Download original .mp3 by HTTPS for each track id (only if we don't have it already).

Todo:

Scrap curators.
Download images (track_image_file, album_image_file, artist_image_file). Beware the quality.
Verify checksum for some random tracks.

Dataset update:

To add new tracks: iterate from largest known track id to the most recent only.
To update user data: we need to get all tracks again.

In [3]:

# ./creation.py metadata
# ./creation.py data /path/to/fma/fma_full
# ./creation.py clips /path/to/fma

#!cat creation.py

In [4]:

# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('raw_tracks.csv', index_col=0)
albums = pd.read_csv('raw_albums.csv', index_col=0)
artists = pd.read_csv('raw_artists.csv', index_col=0)
genres = pd.read_csv('raw_genres.csv', index_col=0)

not_found = pickle.load(open('not_found.pickle', 'rb'))

In [5]:

def get_fs_tids(audio_dir):
    tids = []
    for _, dirnames, files in os.walk(audio_dir):
        if dirnames == []:
            tids.extend(int(file[:-4]) for file in files)
    return tids

audio_tids = get_fs_tids(FMA_FULL)
clips_tids = get_fs_tids(FMA_LARGE)

In [6]:

print('tracks: {} collected ({} not found, {} max id)'.format(
    len(tracks), len(not_found['tracks']), tracks.index.max()))
print('albums: {} collected ({} not found, {} in tracks)'.format(
    len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))
print('artists: {} collected ({} not found, {} in tracks)'.format(
    len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))
print('genres: {} collected'.format(len(genres)))
print('audio: {} collected ({} not found, {} not in tracks)'.format(
    len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))
print('clips: {} collected ({} not found, {} not in tracks)'.format(
    len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))
assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)
assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))
assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)

tracks: 109727 collected (45594 not found, 155320 max id)
albums: 15234 collected (480 not found, 15714 in tracks)
artists: 16916 collected (250 not found, 17166 in tracks)
genres: 164 collected
audio: 110668 collected (180 not found, 1121 not in tracks)
clips: 109261 collected (286 not found, 0 not in tracks)

In [7]:

N = 5
ipd.display(tracks.head(N))
ipd.display(albums.head(N))
ipd.display(artists.head(N))
ipd.display(genres.head(N))

	album_id	album_title	album_url	artist_id	artist_name	artist_url	artist_website	license_image_file	license_image_file_large	license_parent_id	...	track_information	track_instrumental	track_interest	track_language_code	track_listens	track_lyricist	track_number	track_publisher	track_title	track_url
track_id
2	1.0	AWOL - A Way Of Life	http://freemusicarchive.org/music/AWOL/AWOL_-_...	1	AWOL	http://freemusicarchive.org/music/AWOL/	http://www.AzillionRecords.blogspot.com	http://i.creativecommons.org/l/by-nc-sa/3.0/us...	http://fma-files.s3.amazonaws.com/resources/im...	5.0	...	NaN	0	4656	en	1293	NaN	3	NaN	Food	http://freemusicarchive.org/music/AWOL/AWOL_-_...
3	1.0	AWOL - A Way Of Life	http://freemusicarchive.org/music/AWOL/AWOL_-_...	1	AWOL	http://freemusicarchive.org/music/AWOL/	http://www.AzillionRecords.blogspot.com	http://i.creativecommons.org/l/by-nc-sa/3.0/us...	http://fma-files.s3.amazonaws.com/resources/im...	5.0	...	NaN	0	1470	en	514	NaN	4	NaN	Electric Ave	http://freemusicarchive.org/music/AWOL/AWOL_-_...
5	1.0	AWOL - A Way Of Life	http://freemusicarchive.org/music/AWOL/AWOL_-_...	1	AWOL	http://freemusicarchive.org/music/AWOL/	http://www.AzillionRecords.blogspot.com	http://i.creativecommons.org/l/by-nc-sa/3.0/us...	http://fma-files.s3.amazonaws.com/resources/im...	5.0	...	NaN	0	1933	en	1151	NaN	6	NaN	This World	http://freemusicarchive.org/music/AWOL/AWOL_-_...
10	6.0	Constant Hitmaker	http://freemusicarchive.org/music/Kurt_Vile/Co...	6	Kurt Vile	http://freemusicarchive.org/music/Kurt_Vile/	http://kurtvile.com	http://i.creativecommons.org/l/by-nc-nd/3.0/88...	http://fma-files.s3.amazonaws.com/resources/im...	NaN	...	NaN	0	54881	en	50135	NaN	1	NaN	Freeway	http://freemusicarchive.org/music/Kurt_Vile/Co...
20	4.0	Niris	http://freemusicarchive.org/music/Chris_and_Ni...	4	Nicky Cook	http://freemusicarchive.org/music/Chris_and_Ni...	NaN	http://i.creativecommons.org/l/by-nc-nd/3.0/88...	http://fma-files.s3.amazonaws.com/resources/im...	NaN	...	NaN	0	978	en	361	NaN	3	NaN	Spiritual Level	http://freemusicarchive.org/music/Chris_and_Ni...

5 rows × 38 columns

	album_comments	album_date_created	album_date_released	album_engineer	album_favorites	album_handle	album_image_file	album_images	album_information	album_listens	album_producer	album_title	album_tracks	album_type	album_url	artist_name	artist_url	tags
album_id
1	0	11/26/2008 01:44:45 AM	1/05/2009	NaN	4	AWOL_-_A_Way_Of_Life	https://freemusicarchive.org/file/images/album...	[{'image_id': '1955', 'image_file': 'https://f...	<p></p>	6073	NaN	AWOL - A Way Of Life	7	Album	http://freemusicarchive.org/music/AWOL/AWOL_-_...	AWOL	http://freemusicarchive.org/music/AWOL/	[]
100	0	11/26/2008 01:55:44 AM	1/09/2009	NaN	0	On_Opaque_Things	https://freemusicarchive.org/file/images/album...	[{'image_id': '4403', 'image_file': 'https://f...	NaN	5613	NaN	On Opaque Things	4	Album	http://freemusicarchive.org/music/Bird_Names/O...	Bird Names	http://freemusicarchive.org/music/Bird_Names/	[]
1000	0	12/04/2008 09:28:49 AM	10/26/2008	NaN	0	DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F...	https://freemusicarchive.org/file/images/album...	[{'image_id': '31997', 'image_file': 'https://...	<p>http://blog.wfmu.org/freeform/2008/10/what-...	1092	NaN	DMBQ Live at 2008 Record Fair on WFMU Record F...	4	Live Performance	http://freemusicarchive.org/music/DMBQ/DMBQ_Li...	DMBQ	http://freemusicarchive.org/music/DMBQ/	[]
10000	0	9/05/2011 04:42:57 PM	NaN	NaN	0	Live_at_CKUT_on_Montreal_Sessions_1434	https://freemusicarchive.org/file/images/album...	[{'image_id': '12266', 'image_file': 'https://...	<p>Live Set on the Montreal Session February 2...	1001	NaN	Live at CKUT on Montreal Sessions	1	Radio Program	http://freemusicarchive.org/music/Sundrips/Liv...	Sundrips	http://freemusicarchive.org/music/Sundrips/	[]
10001	0	9/06/2011 12:02:58 AM	1/01/2006	NaN	0	Grounds_Dream_Cosmic_Love	https://freemusicarchive.org/file/images/album...	[{'image_id': '24091', 'image_file': 'https://...	<p>Recorded in Linnavuori, Finland, 2005 (with...	504	NaN	Ground's Dream Cosmic Love	1	Album	http://freemusicarchive.org/music/Uton/Grounds...	Uton	http://freemusicarchive.org/music/Uton/	[]

	artist_active_year_begin	artist_active_year_end	artist_associated_labels	artist_bio	artist_comments	artist_contact	artist_date_created	artist_donation_url	artist_favorites	artist_flattr_name	...	artist_location	artist_longitude	artist_members	artist_name	artist_paypal_name	artist_related_projects	artist_url	artist_website	artist_wikipedia_page	tags
artist_id
1	2006.0	NaN	NaN	<p>A Way Of Life, A Collective of Hip-Hop from...	0	Brown Bum aka Choke	11/26/2008 01:42:32 AM	NaN	9	NaN	...	New Jersey	-74.405661	Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...	AWOL	NaN	The list of past projects is 2 long but every1...	http://freemusicarchive.org/music/AWOL/	http://www.AzillionRecords.blogspot.com	NaN	['awol']
10	NaN	NaN	Mistletone, Marriage Records	<p>"Lucky Dragons" means any recorded or perfo...	3	Lukey Dargons	11/26/2008 01:43:35 AM	http://glaciersofnice.com/shop/	111	NaN	...	Los Angeles, CA	-118.243685	Luke Fischbeck\nSarah Rara	Lucky Dragons	NaN	NaN	http://freemusicarchive.org/music/Lucky_Dragons/	http://hawksandsparrows.org/	NaN	['lucky dragons']
100	2004.0	NaN	Captcha Records (HBSP-2X), Pickled Egg (Europe)	<p><span style="font-family:Verdana, Geneva, A...	1	Chris Kalis	11/26/2008 02:05:22 AM	NaN	8	NaN	...	Chicago, IL	-87.629798	Chris Kalis, Harry Brenner, Scott McGaughey, B...	Chandeliers	NaN	Killer Whales, \nMichael Columbia\nMandate\nMr...	http://freemusicarchive.org/music/Chandeliers/	thechandeliers.com	NaN	['chandeliers']
1000	NaN	NaN	NaN	<p><a href="http://marzipanmarzipan.com">Marzi...	0	NaN	12/04/2008 09:24:35 AM	NaN	0	NaN	...	NaN	12.567380	NaN	Marzipan Marzipan	NaN	NaN	http://freemusicarchive.org/music/Marzipan_Mar...	https://soundcloud.com/marzipanmarzipan	NaN	[]
10000	NaN	NaN	NaN	<p><span style="font-family:'Times New Roman',...	0	NaN	1/21/2011 02:11:31 PM	NaN	1	NaN	...	NaN	NaN	Jack Hertz\nPHOBoS\nBlue Hell	Jack Hertz, PHOBoS, Blue Hell	NaN	NaN	http://freemusicarchive.org/music/Jack_Hertz_P...	http://surrism.phonoethics.com/surrism-phonoet...	NaN	['jack hertz phobos blue hell']

5 rows × 24 columns

	genre_color	genre_handle	genre_parent_id	genre_title
genre_id
1	#006666	Avant-Garde	38.0	Avant-Garde
2	#CC3300	International	NaN	International
3	#000099	Blues	NaN	Blues
4	#990099	Jazz	NaN	Jazz
5	#8A8A65	Classical	NaN	Classical

2 Format metadata¶

Todo:

Sanitize values, e.g. list of words for tags, valid links in artist_wikipedia_page, remove html markup in free-form text.
- Clean tags. E.g. some tags are just artist names.
Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.
Update duration from audio
- 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.
- 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56
- ffmpeg: Estimating duration from bitrate, this may be inaccurate
- Solution, decode the complete mp3: ffmpeg -i input.mp3 -f null -

In [8]:

df, column = tracks, 'tags'
null = sum(df[column].isnull())
print('{} null, {} non-null'.format(null, df.shape[0] - null))
df[column].value_counts().head(10)

0 null, 109727 non-null

Out[8]:

[]                                                                                                                                                                             85881
['interiors c1964', 'existential', 'hardcore-punk', 'pop-punk', 'punk-rock', 'internet boyfriend', 'rew starr', 'public domain', 'creative commons', 'microsong challenge']      314
['classwar karaoke']                                                                                                                                                             239
['all styles experimental']                                                                                                                                                      215
['improvisation', 'not normal music', 'all styles experimental']                                                                                                                 195
['era 1']                                                                                                                                                                        176
['all styles experimental', 'harsh noise', 'not normal music']                                                                                                                   150
['music is a belief', 'chary', 'nishad', 'uju', 'ibiene', 'nazeem', 'deepu', 'maneet', 'azedine', 'mohammad']                                                                    140
['new zealand']                                                                                                                                                                  140
['improvisation', 'all styles experimental', 'not normal music']                                                                                                                 128
Name: tags, dtype: int64

2.1 Tracks¶

In [9]:

drop = [
    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only
    'track_file', 'track_image_file',  # used to download only
    'track_url', 'album_url', 'artist_url',  # only relevant on website
    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only
    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks
    'track_disc_number',  # different from 1 for <1000 tracks
    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks
    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre
]
tracks.drop(drop, axis=1, inplace=True)
tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)

In [10]:

tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)

In [11]:

def convert_datetime(df, column, format=None):
    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)
convert_datetime(tracks, 'track_date_created')
convert_datetime(tracks, 'track_date_recorded')

In [12]:

tracks['album_id'].fillna(-1, inplace=True)
tracks['track_bit_rate'].fillna(-1, inplace=True)
tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})

In [13]:

def convert_genres(genres):
    genres = ast.literal_eval(genres)
    return [int(genre['genre_id']) for genre in genres]

tracks['track_genres'].fillna('[]', inplace=True)
tracks['track_genres'] = tracks['track_genres'].map(convert_genres)

In [14]:

tracks.columns

Out[14]:

Index(['album_id', 'album_title', 'artist_id', 'artist_name', 'artist_website',
       'track_license', 'track_tags', 'track_bit_rate', 'track_comments',
       'track_composer', 'track_date_created', 'track_date_recorded',
       'track_duration', 'track_favorites', 'track_genres',
       'track_information', 'track_interest', 'track_language_code',
       'track_listens', 'track_lyricist', 'track_number', 'track_publisher',
       'track_title'],
      dtype='object')

2.2 Albums¶

In [15]:

drop = [
    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)
    'album_handle',
    'album_image_file', 'album_images',  # todo: shall be downloaded
    #'album_producer', 'album_engineer',  # present for ~2400 albums only
]
albums.drop(drop, axis=1, inplace=True)
albums.rename(columns={'tags': 'album_tags'}, inplace=True)

In [16]:

convert_datetime(albums, 'album_date_created')
convert_datetime(albums, 'album_date_released')

In [17]:

albums.columns

Out[17]:

Index(['album_comments', 'album_date_created', 'album_date_released',
       'album_engineer', 'album_favorites', 'album_information',
       'album_listens', 'album_producer', 'album_title', 'album_tracks',
       'album_type', 'album_tags'],
      dtype='object')

2.3 Artists¶

In [18]:

drop = [
    'artist_website', 'artist_url',  # in tracks already (though it can be different)
    'artist_handle',
    'artist_image_file', 'artist_images',  # todo: shall be downloaded
    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant
    'artist_contact',  # ~1500, not very useful data
    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only
    # 'artist_associated_labels',  # ~1000
    # 'artist_related_projects',  # only ~800, but can be combined with bio
]
artists.drop(drop, axis=1, inplace=True)
artists.rename(columns={'tags': 'artist_tags'}, inplace=True)

In [19]:

convert_datetime(artists, 'artist_date_created')
for column in ['artist_active_year_begin', 'artist_active_year_end']:
    artists[column].replace(0.0, np.nan, inplace=True)
    convert_datetime(artists, column, format='%Y.0')

In [20]:

artists.columns

Out[20]:

Index(['artist_active_year_begin', 'artist_active_year_end',
       'artist_associated_labels', 'artist_bio', 'artist_comments',
       'artist_date_created', 'artist_favorites', 'artist_latitude',
       'artist_location', 'artist_longitude', 'artist_members', 'artist_name',
       'artist_related_projects', 'artist_wikipedia_page', 'artist_tags'],
      dtype='object')

2.4 Merge DataFrames¶

In [21]:

not_found['albums'].remove(None)
not_found['albums'].append(-1)
not_found['albums'] = [int(i) for i in not_found['albums']]
not_found['artists'] = [int(i) for i in not_found['artists']]

In [22]:

tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['album_title_dup'].isnull())
print('{} tracks without extended album information ({} tracks without album_id)'.format(
    n, sum(tracks['album_id'] == -1)))
assert sum(tracks['album_id'].isin(not_found['albums'])) == n
assert sum(tracks['album_title'] != tracks['album_title_dup']) == n

tracks.drop('album_title_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

3674 tracks without extended album information (1041 tracks without album_id)

In [23]:

# Album artist can be different than track artist. Keep track artist.
#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)

In [24]:

tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['artist_name_dup'].isnull())
print('{} tracks without extended artist information'.format(n))
assert sum(tracks['artist_id'].isin(not_found['artists'])) == n
assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n

tracks.drop('artist_name_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

974 tracks without extended artist information

In [25]:

columns = []
for name in tracks.columns:
    names = name.split('_')
    columns.append((names[0], '_'.join(names[1:])))
tracks.columns = pd.MultiIndex.from_tuples(columns)
assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))

In [26]:

# Todo: fill other columns ?
tracks['album', 'tags'].fillna('[]', inplace=True)
tracks['artist', 'tags'].fillna('[]', inplace=True)

columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),
           ('artist', 'favorites'), ('artist', 'comments')]
for column in columns:
    tracks[column].fillna(-1, inplace=True)
columns = {column: int for column in columns}
tracks = tracks.astype(columns)

3 Data cleaning¶

Todo: duplicates (metadata and audio)

In [27]:

def keep(index, df):
    old = len(df)
    df = df.loc[index]
    new = len(df)
    print('{} lost, {} left'.format(old - new, new))
    return df

tracks = keep(tracks.index, tracks)

0 lost, 109727 left

In [28]:

# Audio not found or could not be trimmed.
tracks = keep(tracks.index.difference(not_found['audio']), tracks)
tracks = keep(tracks.index.difference(not_found['clips']), tracks)

180 lost, 109547 left
286 lost, 109261 left

Errors from the features.py script.

IndexError('index 0 is out of bounds for axis 0 with size 0',)
- ffmpeg: Header missing
- ffmpeg: Could not find codec parameters for stream 0 (Audio: mp3, 0 channels, s16p): unspecified frame size. Consider increasing the value for the 'analyzeduration' and 'probesize' options
- tids: 117759
NoBackendError()
- ffmpeg: Format mp3 detected only with low score of 1, misdetection possible!
- tids: 80015, 115235
UserWarning('Trying to estimate tuning from empty frequency set.',)
- librosa error
- tids: 1440, 26436, 38903, 57603, 62095, 62954, 62956, 62957, 62959, 62971, 86079, 96426, 104623, 106719, 109714, 114501, 114528, 118003, 118004, 127827, 130298, 130296, 131076, 135804, 154923
ParameterError('Filter pass-band lies beyond Nyquist',)
- librosa error
- tids: 152204, 28106, 29166, 29167, 29169, 29168, 29170, 29171, 29172, 29173, 29179, 43903, 56757, 59361, 75461, 92346, 92345, 92347, 92349, 92350, 92351, 92353, 92348, 92352, 92354, 92355, 92356, 92358, 92359, 92361, 92360, 114448, 136486, 144769, 144770, 144771, 144773, 144774, 144775, 144778, 144776, 144777

In [29]:

# Feature extraction failed.
FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,
          29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,
          62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,
          92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,
          92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,
          115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,
          144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,
          154923]
tracks = keep(tracks.index.difference(FAILED), tracks)

71 lost, 109190 left

In [30]:

# License forbids redistribution.
tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)
print('{} licenses'.format(len(tracks[('track', 'license')].unique())))

2616 lost, 106574 left
114 licenses

In [31]:

#sum(tracks['track', 'title'].duplicated())

4 Genres¶

In [32]:

genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)
genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)

In [33]:

genres['parent'].fillna(0, inplace=True)
genres = genres.astype({'parent': int})

In [34]:

# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu
genres.at[13, 'parent'] = 0

# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website
genres.at[580, 'parent'] = 21

# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website
genres.at[810, 'parent'] = 13

# 763 (Holiday) has parent 763 which is itself
# --> listed as child of Sound Effects on website
genres.at[763, 'parent'] = 16

# Todo: should novelty be under Experimental? It is alone on website.

In [35]:

# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).
print('{} tracks have genre 806'.format(
    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))
def change_genre(genres):
    return [genre if genre != 806 else 21 for genre in genres]
tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)
genres.drop(806, inplace=True)

34 tracks have genre 806

In [36]:

def get_parent(genre, track_all_genres=None):
    parent = genres.at[genre, 'parent']
    if track_all_genres is not None:
        track_all_genres.append(genre)
    return genre if parent == 0 else get_parent(parent, track_all_genres)

# Get all genres, i.e. all genres encountered when walking from leafs to roots.
def get_all_genres(track_genres):
    track_all_genres = list()
    for genre in track_genres:
        get_parent(genre, track_all_genres)
    return list(set(track_all_genres))

tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)

In [37]:

# Number of tracks per genre.
def count_genres(subset=tracks.index):
    count = pd.Series(0, index=genres.index)
    for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():
        for genre in track_all_genres:
            count[genre] += 1
    return count

genres['#tracks'] = count_genres()
genres[genres['#tracks'] == 0]

Out[37]:

	parent	title	#tracks
genre_id
175	86	Bollywood	0
178	4	Be-Bop	0

In [38]:

def get_top_genre(track_genres):
    top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)
    return top_genres.pop() if len(top_genres) == 1 else np.nan

# Top-level genre.
genres['top_level'] = genres.index.map(get_parent)
tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)

In [39]:

genres.head(10)

Out[39]:

	parent	title	#tracks	top_level
genre_id
1	38	Avant-Garde	8693	38
2	0	International	5271	2
3	0	Blues	1752	3
4	0	Jazz	4126	4
5	0	Classical	4106	5
6	38	Novelty	914	38
7	20	Comedy	217	20
8	0	Old-Time / Historic	868	8
9	0	Country	1987	9
10	0	Pop	13845	10

5 Subsets: large, medium, small¶

5.1 Large¶

Main characteristic: the full set with clips trimmed to a manageable size.

5.2 Medium¶

Main characteristic: clean metadata (includes 1 top-level genre) and quality audio.

In [40]:

fma_medium = pd.DataFrame(tracks)

In [41]:

# Missing meta-information.

# Missing extended album and artist information.
fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)
fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)

# Untitled track or album.
fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)
fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)

# One tag is often just the artist name. Tags too scarce for tracks and albums.
#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)

3529 lost, 103045 left
598 lost, 102447 left
1 lost, 102446 left
674 lost, 101772 left
65 lost, 101707 left

In [42]:

# Technical quality.
# Todo: sample rate
fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)

# Choosing standard bit rates discards all VBR.
#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)

1326 lost, 100381 left

In [43]:

fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)
fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)

fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)

4736 lost, 95645 left
5399 lost, 90246 left
466 lost, 89780 left
5353 lost, 84427 left

In [44]:

# Lower popularity bound.
fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)
fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)
fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);

# Favorites and comments are very scarce.
#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)

4941 lost, 79486 left
1064 lost, 78422 left
1769 lost, 76653 left

In [45]:

# Targeted genre classification.
fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);
#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);

42495 lost, 34158 left

In [46]:

# Adjust size with popularity measure. Should be of better quality.
N_TRACKS = 25000

# Observations
# * More albums killed than artists --> be sure not to kill diversity
# * Favorites and preterites genres differently --> do it per genre?
# Normalization
# * mean, median, std, max
# * tracks per album or artist
# Test
# * 4/5 of same tracks were selected with various set of measures
# * <5% diff with max and mean

popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')
# ('track', 'favorites'), ('track', 'comments'),
# ('album', 'favorites'), ('album', 'comments'),
# ('artist', 'favorites'), ('artist', 'comments'),

normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}
def popularity_measure(track):
    return sum(track[measure] / normalization[measure] for measure in popularity_measures)
fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)
fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)

9158 lost, 25000 left

In [47]:

tmp = genres[genres['parent'] == 0].reset_index().set_index('title')
tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()
tmp.sort_values('#tracks_medium', ascending=False)

Out[47]:

	genre_id	parent	#tracks	top_level	#tracks_medium
title
Rock	12	0	32923	12	7103
Electronic	15	0	34413	15	6314
Experimental	38	0	38154	38	2251
Hip-Hop	21	0	8389	21	2201
Folk	17	0	12706	17	1519
Instrumental	1235	0	14938	1235	1350
Pop	10	0	13845	10	1186
International	2	0	5271	2	1018
Classical	5	0	4106	5	619
Old-Time / Historic	8	0	868	8	510
Jazz	4	0	4126	4	384
Country	9	0	1987	9	178
Soul-RnB	14	0	1499	14	154
Spoken	20	0	1876	20	118
Blues	3	0	1752	3	74
Easy Listening	13	0	730	13	21

5.3 Small¶

Main characteristic: genre balanced (and echonest features).

Choices:

8 genres with 1000 tracks --> 8,000 tracks
10 genres with 500 tracks --> 5,000 tracks

Todo:

Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small.

In [48]:

N_GENRES = 8
N_TRACKS = 1000

top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index
fma_small = pd.DataFrame(fma_medium)
fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)

2058 lost, 22942 left

In [49]:

to_keep = []
for genre in top_genres:
    subset = fma_small[fma_small['track', 'genre_top'] == genre]
    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]
    fma_small.drop(drop, inplace=True)
assert len(fma_small) == N_GENRES * N_TRACKS

5.4 Subset indication¶

In [50]:

SUBSETS = ('small', 'medium', 'large')
tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)
tracks.loc[tracks.index, ('set', 'subset')] = 'large'
tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'
tracks.loc[fma_small.index, ('set', 'subset')] = 'small'

5.5 Echonest¶

In [51]:

echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])
echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)

echonest = keep(echonest.index.isin(tracks.index), echonest);
keep(echonest.index.isin(fma_medium.index), echonest);
keep(echonest.index.isin(fma_small.index), echonest);

0 lost, 14511 left
205 lost, 14306 left
239 lost, 14067 left
938 lost, 13129 left
7848 lost, 5281 left
11835 lost, 1294 left

6 Splits: training, validation, test¶

Take into account:

Artists may only appear on one side.
Stratification: ideally, all characteristics (#tracks per artist, duration, sampling rate, information, bio) and targets (genres, tags) should be equally distributed.

In [52]:

for genre in genres.index:
    tracks['genre', genres.at[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)

SPLITS = ('training', 'test', 'validation')
PERCENTAGES = (0.8, 0.1, 0.1)
tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)

for subset in SUBSETS:

    tracks_subset = tracks['set', 'subset'] <= subset

    # Consider only top-level genres for small and medium.
    genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())
    if subset == 'large':
        genre_list = list(genres['title']) 

    while True:
        if len(genre_list) == 0:
            break

        # Choose most constrained genre, i.e. genre with the least unassigned artists.
        tracks_unsplit = tracks['set', 'split'].isnull()
        count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']
        count = count.groupby(level=1).sum().astype(np.bool).sum()
        genre = np.argmin(count[genre_list])
        genre_list.remove(genre)
        
        # Given genre, select artists.
        tracks_genre = tracks['genre', genre] == 1
        artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()
        #print('-->', genre, len(artists))

        current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}

        # Assign artists with most tracks first.
        for artist, count in artists.items():
            choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])
            current[SPLITS[choice]] += count
            #assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()
            tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]

# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.
no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)
no_split = tracks['set', 'split'].isnull()
assert not (no_split & ~no_genres).any()
tracks.loc[no_split, ('set', 'split')] = 'training'

# Not needed any more.
tracks.drop('genre', axis=1, level=0, inplace=True)

7 Store¶

In [53]:

for dataset in 'tracks', 'genres', 'echonest':
    eval(dataset).sort_index(axis=0, inplace=True)
    eval(dataset).sort_index(axis=1, inplace=True)
    params = dict(float_format='%.10f') if dataset == 'echonest' else dict()
    eval(dataset).to_csv(dataset + '.csv', **params)

In [54]:

# ./creation.py normalize /path/to/fma
# ./creation.py zips /path/to/fma

8 Description¶

In [55]:

tracks = utils.load('tracks.csv')
tracks.dtypes

Out[55]:

album   comments                      int64
        date_created         datetime64[ns]
        date_released        datetime64[ns]
        engineer                     object
        favorites                     int64
        id                            int64
        information                category
        listens                       int64
        producer                     object
        tags                         object
        title                        object
        tracks                        int64
        type                       category
artist  active_year_begin    datetime64[ns]
        active_year_end      datetime64[ns]
        associated_labels            object
        bio                        category
        comments                      int64
        date_created         datetime64[ns]
        favorites                     int64
        id                            int64
        latitude                    float64
        location                     object
        longitude                   float64
        members                      object
        name                         object
        related_projects             object
        tags                         object
        website                      object
        wikipedia_page               object
set     split                        object
        subset                     category
track   bit_rate                      int64
        comments                      int64
        composer                     object
        date_created         datetime64[ns]
        date_recorded        datetime64[ns]
        duration                      int64
        favorites                     int64
        genre_top                  category
        genres                       object
        genres_all                   object
        information                  object
        interest                      int64
        language_code                object
        license                    category
        listens                       int64
        lyricist                     object
        number                        int64
        publisher                    object
        tags                         object
        title                        object
dtype: object

In [56]:

N = 5
ipd.display(tracks['track'].head(N))
ipd.display(tracks['album'].head(N))
ipd.display(tracks['artist'].head(N))

	bit_rate	comments	composer	date_created	date_recorded	duration	favorites	genre_top	genres	genres_all	information	interest	language_code	license	listens	lyricist	number	publisher	tags	title
track_id
2	256000	0	NaN	2008-11-26 01:48:12	2008-11-26	168	2	Hip-Hop	[21]	[21]	NaN	4656	en	Attribution-NonCommercial-ShareAlike 3.0 Inter...	1293	NaN	3	NaN	[]	Food
3	256000	0	NaN	2008-11-26 01:48:14	2008-11-26	237	1	Hip-Hop	[21]	[21]	NaN	1470	en	Attribution-NonCommercial-ShareAlike 3.0 Inter...	514	NaN	4	NaN	[]	Electric Ave
5	256000	0	NaN	2008-11-26 01:48:20	2008-11-26	206	6	Hip-Hop	[21]	[21]	NaN	1933	en	Attribution-NonCommercial-ShareAlike 3.0 Inter...	1151	NaN	6	NaN	[]	This World
10	192000	0	Kurt Vile	2008-11-25 17:49:06	2008-11-26	161	178	Pop	[10]	[10]	NaN	54881	en	Attribution-NonCommercial-NoDerivatives (aka M...	50135	NaN	1	NaN	[]	Freeway
20	256000	0	NaN	2008-11-26 01:48:56	2008-01-01	311	0	NaN	[76, 103]	[17, 10, 76, 103]	NaN	978	en	Attribution-NonCommercial-NoDerivatives (aka M...	361	NaN	3	NaN	[]	Spiritual Level

	comments	date_created	date_released	engineer	favorites	id	information	listens	producer	tags	title	tracks	type
track_id
2	0	2008-11-26 01:44:45	2009-01-05	NaN	4	1	<p></p>	6073	NaN	[]	AWOL - A Way Of Life	7	Album
3	0	2008-11-26 01:44:45	2009-01-05	NaN	4	1	<p></p>	6073	NaN	[]	AWOL - A Way Of Life	7	Album
5	0	2008-11-26 01:44:45	2009-01-05	NaN	4	1	<p></p>	6073	NaN	[]	AWOL - A Way Of Life	7	Album
10	0	2008-11-26 01:45:08	2008-02-06	NaN	4	6	NaN	47632	NaN	[]	Constant Hitmaker	2	Album
20	0	2008-11-26 01:45:05	2009-01-06	NaN	2	4	<p> "spiritual songs" from Nicky Cook</p>	2710	NaN	[]	Niris	13	Album

	active_year_begin	active_year_end	associated_labels	bio	comments	date_created	favorites	id	latitude	location	longitude	members	name	related_projects	tags	website	wikipedia_page
track_id
2	2006-01-01	NaT	NaN	<p>A Way Of Life, A Collective of Hip-Hop from...	0	2008-11-26 01:42:32	9	1	40.058324	New Jersey	-74.405661	Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...	AWOL	The list of past projects is 2 long but every1...	[awol]	http://www.AzillionRecords.blogspot.com	NaN
3	2006-01-01	NaT	NaN	<p>A Way Of Life, A Collective of Hip-Hop from...	0	2008-11-26 01:42:32	9	1	40.058324	New Jersey	-74.405661	Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...	AWOL	The list of past projects is 2 long but every1...	[awol]	http://www.AzillionRecords.blogspot.com	NaN
5	2006-01-01	NaT	NaN	<p>A Way Of Life, A Collective of Hip-Hop from...	0	2008-11-26 01:42:32	9	1	40.058324	New Jersey	-74.405661	Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...	AWOL	The list of past projects is 2 long but every1...	[awol]	http://www.AzillionRecords.blogspot.com	NaN
10	NaT	NaT	Mexican Summer, Richie Records, Woodsist, Skul...	<p><span style="font-family:Verdana, Geneva, A...	3	2008-11-26 01:42:55	74	6	NaN	NaN	NaN	Kurt Vile, the Violators	Kurt Vile	NaN	[philly, kurt vile]	http://kurtvile.com	NaN
20	1990-01-01	2011-01-01	NaN	<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...	2	2008-11-26 01:42:52	10	4	51.895927	Colchester England	0.891874	Nicky Cook\n	Nicky Cook	NaN	[instrumentals, experimental pop, post punk, e...	NaN	NaN

FMA: A Dataset For Music Analysis¶

Creation¶

1 Retrieve metadata and audio from FMA¶

2 Format metadata¶

2.1 Tracks¶

2.2 Albums¶

2.3 Artists¶

2.4 Merge DataFrames¶

3 Data cleaning¶

4 Genres¶

5 Subsets: large, medium, small¶

5.1 Large¶

5.2 Medium¶

5.3 Small¶

5.4 Subset indication¶

5.5 Echonest¶

6 Splits: training, validation, test¶

7 Store¶

8 Description¶

FMA: A Dataset For Music Analysis ¶