The first part of this tutorial series demonstrates a distance based similarity search approach, based on extracted content descriptors and distance metrics.
The objective of Music Similarity estimation or retrieval is to estimate the notion of similarity between two given tracks. A central part of such an approaches is the definition of a measure for similarity which is further affected by the approach taken to extract the relevant information. One approach is to analyze contextual data such as user generated listening behaviour data (e.g. play/skip-counts, user-tags, ratings, etc.). The approach followed by this tutorial is based on the music content itself and largely focuses on the notion of acoustic similarity. Music features are extracted from the audio content. The resulting music descriptors are high-dimensional numeric vectors and the accumulation of all feature vectors of a collection forms a vector-space. The general principle of content based similarity estimations is based on the assumption that numerical differences are an expression of perceptual dissimilarity. Different metrics such as the Manhattan (L1) or the Euclidean Distance (L2) or non-metric similarity functions such as the Kullback-Leibler divergence are used to estimate the numerical similarity of the feature vectors.
Please follow the instructions on the tutorial's Github page to install the following dependencies to run this tutorial:
# visualization
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
# numeric and scientific processing
import numpy as np
import pandas as pd
# misc
import os
import progressbar
from IPython.display import HTML, display
pd.set_option('display.max_colwidth', -1)
Before we can train our models we first have to get some data.
DATASET_PATH = "D:/Research/Data/MIR/MagnaTagATune/ISMIR2018"
load feature data from numpy pickle
with np.load("%s/ISMIR2018_tut_Magnagtagatune_rp_features.npz" % DATASET_PATH) as npz:
features_rp = npz["rp"]
features_ssd = npz["ssd"]
clip_id = npz["clip_id"]
prepare feature-metadata for alignment with dataset meta-data
feature_metadata = pd.DataFrame({"featurespace_id": np.arange(features_rp.shape[0]),
"clip_id" : clip_id})
load meta-data from csv-file.
metadata = pd.read_csv("./metadata/ismir2018_tut_part_3_similartiy_metadata.csv", index_col=0)
metadata.head()
clip_id | mp3_path | track_number | title | artist | album | url | segmentStart | segmentEnd | original_url | |
---|---|---|---|---|---|---|---|---|---|---|
19200 | 42150 | D:/Research/Data/MIR/MagnaTagATune/mp3_full/f/professor_armchair-too_much_mustard-10-bethena-117-146.mp3 | 10 | Bethena | Professor Armchair | Too Much Mustard | http://www.magnatune.com/artists/albums/armchair-octopants/ | 117 | 146 | http://he3.magnatune.com/all/10-Bethena-Professor%20Armchair.mp3 |
7745 | 16994 | D:/Research/Data/MIR/MagnaTagATune/mp3_full/c/liquid_zen-seventythree-04-close-262-291.mp3 | 4 | Close | Liquid Zen | Seventythree | http://www.magnatune.com/artists/albums/liquid-seventythree/ | 262 | 291 | http://he3.magnatune.com/all/04-Close%20-%20Liquid%20Zen.mp3 |
25416 | 57455 | D:/Research/Data/MIR/MagnaTagATune/mp3_full/6/doc_rossi-demarzi6_sonatas_for_cetra_o_kitara-23-sonata_vi_allegro-59-88.mp3 | 23 | Sonata VI Allegro | Doc Rossi | Demarzi-6 Sonatas for Cetra o Kitara | http://www.magnatune.com/artists/albums/rossi-demarzi/ | 59 | 88 | http://he3.magnatune.com/all/23-Sonata%20VI%20Allegro-Doc%20Rossi.mp3 |
613 | 1473 | D:/Research/Data/MIR/MagnaTagATune/mp3_full/b/philharmonia_baroque-beethoven_symphonies_no_3_eroica_and_no_8-01-eroica_1st-117-146.mp3 | 1 | Eroica 1st | Philharmonia Baroque | Beethoven Symphonies No 3 Eroica and No 8 | http://www.magnatune.com/artists/albums/pb-eroica/ | 117 | 146 | http://he3.magnatune.com/all/01-Eroica%201st-Philharmonia%20Baroque.mp3 |
15095 | 33011 | D:/Research/Data/MIR/MagnaTagATune/mp3_full/0/barbara_leoni-human_needs-07-ring_around_the_rosey-88-117.mp3 | 7 | Ring around the rosey | Barbara Leoni | Human Needs | http://www.magnatune.com/artists/albums/leoni-human/ | 88 | 117 | http://he3.magnatune.com/all/07-Ring%20around%20the%20rosey-Barbara%20Leoni.mp3 |
Align featuredata with metadata
metadata = metadata.reset_index()
metadata = metadata.merge(feature_metadata, left_on="clip_id", right_on="clip_id", how="inner", left_index=True, right_index=False)
metadata = metadata.set_index("index")
Add Media-Player to Metadata
tmp = metadata.mp3_path.str.split("/", expand=True)
metadata["player"] = '<audio src="http://localhost:9999/' + tmp[6] + '/' + tmp[7] +'" controls>'
Use jupyters display functions to enable HTML5 audio in pandas dataframe
Example
HTML(metadata.iloc[:3][["title", "player"]].to_html(escape=False))
title | player | |
---|---|---|
index | ||
19200 | Bethena | |
7745 | Close | |
25416 | Sonata VI Allegro |
feature_data = np.concatenate([features_rp, features_ssd], axis=1)
feature_data.shape
(6380, 1608)
Sort Metadata by Feature-space ID => metadata is aligned to feature-data
metadata.sort_values("featurespace_id", inplace=True)
Select feature-data for this metadata
# subsample feature-space
feature_data = feature_data[metadata.featurespace_id]
# re-enumerate sub-sampled feature-space (alignment)
metadata["featurespace_id"] = np.arange(metadata.shape[0])
The feature vectors are composed of differnt feature-sets. All of them with different value ranges. While features such as Acousticness and Danceability are scaled between 0 and 1, the BPM values of the tempo feature ranges around 120 or higher. We apply Standard Score or Zero Mean and Unit Variance normalization to uniformly scale the value ranges of the features.
$$ z = {x- \mu \over \sigma} $$# standardize sequential_features
feature_data -= feature_data.mean(axis=0)
feature_data /= feature_data.std(axis=0)
C:\anaconda\lib\site-packages\ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in true_divide This is separate from the ipykernel package so we can avoid doing imports until
feature_data = np.nan_to_num(feature_data, 0)
This section describes the fundamentals of the content-based audio similarity search approach followed in this tutorial. Audio features are descriptive numbers calculated from the audio spectrum of a track. A good example is the Spectral Centroid, which can be interpreted as the center of gravity of an audio recording. It describes the average frequency weighted by its intensity and distinguishes brighter from darker sounds. Such features are usually calculated for several intervals of a track and finally aggregated into a single vector representation. The latter step, which is a requirement for many machine/statistical learning tasks, is accomplished by calculating statistical measures such as mean, standard deviation, etc.
In the following example, the Spectral Centroids of 10 different tracks are provided using their mean and standard deviation aggregations. Thus, the Spectral Centroid feature(-set) is represented by a two-dimensional feature vector such as the following example:
ID Mean Standard Deviation
0 1517.5993814237531 291.1855836731788
In this example the center frequency is 1518 Hz and it deviates by 291 Hz. These numbers already describe the audio content and can be used to find similar tracks. The common approach to calcualte music similarity from audio content is based on vector difference. The assumption is, that similar audio feature-values correspond with similar audio content. Thus, feature vectors with smaller vector differences correspond to more similar tracks. The following data represents the extracted Spectral Centroids of our 10-tracks collection:
ID Mean Standard Deviation
0 1517.5993814237531 291.1855836731788
1 1659.1988993873124 327.64811981777865
2 1507.4617047141264 340.8830079395701
3 1597.6019371942953 507.1007933367403
4 1498.8531206911534 288.3780838480238
5 535.5910732230583 89.90893994909047
6 2261.4032345595674 353.5971736260454
7 2331.881852844861 406.33517225264194
8 1868.690426450363 342.7489751514078
9 2204.6324484864085 328.94334883095553
The tracks have unique identifiers and we are using the track with ID 5 to search for similar items. This step requires a similarity metric, which defines how the vector distance has to be calculated as a single numeric value. The most common choices are the Manhattan (L1) and Euclidean (L2) distance measures. The Euclidean Distance is the square root of the sum of squared differences of two vectors. To calculate the Euclidean Distance between track 5 and track 0:
ID Mean Standard Deviation
0 1517.5993814237531 291.1855836731788
5 535.5910732230583 89.90893994909047
we first compute the difference between the values of each vectors
982.008308 201.276644
square them to get the absolute magnitude:
964340.317375 40512.287309
and take the sum of these values:
1004852.6046840245
Per definition the square root has to be calculated from the sum, but this step is normally skipped because it does not alter the ranking and is processing intensive. By calculating the distance for all items in the collection, we retrieve a list of distance values where the smaller distances correspond to more similar audio content and the higher values should sound more dissimilar.
ID Distance
0 1004852.6046840245
1 1319014.4646621975
2 1007520.5071585375
3 1301916.1177259558
4 967263.7731724023
5 0.0
6 3047959.100796666
7 3326786.1254441254
8 1841081.968976167
9 2842836.5609704787
To retrieve a ranked list of similar sounding tracks, the list of vector distances has to be ordered ascendingly.
ID Distance
5 0.0
4 967263.7731724023
0 1004852.6046840245
2 1007520.5071585375
3 1301916.1177259558
1 1319014.4646621975
8 1841081.968976167
9 2842836.5609704787
6 3047959.100796666
7 3326786.1254441254
This so called vector space model is predominant in content based multimedia retrieval. The most crucial and problematic part is feature crafting, meaning that in the case in which the extracted numbers do not describe the audio well enough, the vector based similarity will also fail to provide results that are perceived as similar. The described approach requires the availability of all feature vectors of all items of a collection. Thus, the feature vectors must be stored. No matter which retrieval approach (pre-calculated / indexed / on demand) will be chosen, all features will be required at a certain time. Given that the feature extraction is an computationally expensive task (in terms of processing resources and total time), the extracted features are stored and made accessible using a common data format.
In the final part of this tutorial we wil use the Euclidean Distance to calculate similarities between tracks. As mentioned above, the Euclidean Distance is a metric to calculate the distance between two vectors and thus is a function of dissimilarity. This means, vectors with smaller distance values are more similar than those with higher distances.
$$ d(p,q) = \sqrt{\sum_{i=1}^n (q_i-p_i)^2} $$def eucledian_distance(feature_space, query_vector):
return np.sqrt(np.sum((feature_space - query_vector)**2, axis=1))
For the rest of the tutorial we will use this song to demonstrate the results of the approach:
# display top-10 results (first track = query track)
display_cols = ["artist", "title", "album", "player"]
query_track_idx = 1102
HTML(metadata[display_cols].iloc[[query_track_idx]].to_html(escape=False))
artist | title | album | player | |
---|---|---|---|---|
index | ||||
18465 | Chris Harvey | Pixelize | The White Sail |
The following lines of code implement the approach described above. First, the distances between the query vector and all other vectors of the collection are calculated. Then the distances are sorted ascnedingly to get the simlar tracks. Because the metric distance of identical vectors is 0, the top-most entry of the sorted list is always the query track.
# calculate the distance between the query-vector and all others
dist = eucledian_distance(feature_data, feature_data[metadata.iloc[query_track_idx].featurespace_id])
# sort the distances ascendingly - use sorted index
sorted_idx = np.argsort(dist)
HTML(metadata.set_index("featurespace_id").loc[sorted_idx[:11]][display_cols].to_html(escape=False))
artist | title | album | player | |
---|---|---|---|---|
featurespace_id | ||||
1102 | Chris Harvey | Pixelize | The White Sail | |
2774 | Jami Sieber | Dancing at the Temple Gate | Lush Mechanique | |
2318 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
315 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
24 | Mijo | Click Here | Fata Morgana | |
1137 | Magnatune Compilation | Tim Rayborn_ Quen a Virgen ben servir | World Fusion | |
2052 | C Layne | How Soon I Forget | The Sun Will Come Out to Blind You | |
2021 | Sun Palace | Your Hands Lie Open | Into Heaven | |
2097 | Curl | Sincerely Sorry | Inner | |
873 | Atomic Opera | Meaningless Word | Alpha and Oranges | |
611 | Burnshee Thornside | Ha-Keem | Blues and misc |
The approach taken to combine the different feature-sets is refered to as early fusion. The problem with the approach described in the previous step is, that larger feature-sets dominate the calculated distance values. The aggregated MFCC and Chroma features have 24 dimensions each. Together they have more dimensions as the remaining features which are mostly single dimensional features. Thus, the distances are unequally dominated by the two feature sets.
To avoid such a bias, we scale the feature-space such that feature-sets and single-value features have euqal the same weights and thus euqal influence on the resulting distance.
# feature-set lengths and order
featureset_lengths = [1440, # rp
168] # ssd
def scaled_eucledian_distance(feature_space, query_vector):
distances = (feature_space - query_vector)**2
# feature_start_idx
start_idx = 0
# normalize distances
for sequence_length in featureset_lengths:
# feature_stop_idx
stop_idx = start_idx + sequence_length
distances[:,start_idx:stop_idx] /= sequence_length#distances[:,start_idx:stop_idx].sum(axis=1).max()
start_idx = stop_idx
return np.sqrt(np.sum(distances, axis=1))
Example result
dist = scaled_eucledian_distance(feature_data, feature_data[metadata.iloc[query_track_idx].featurespace_id])
HTML(metadata.set_index("featurespace_id").loc[np.argsort(dist)[:11]][display_cols].to_html(escape=False))
artist | title | album | player | |
---|---|---|---|---|
featurespace_id | ||||
1102 | Chris Harvey | Pixelize | The White Sail | |
2318 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
315 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
2774 | Jami Sieber | Dancing at the Temple Gate | Lush Mechanique | |
24 | Mijo | Click Here | Fata Morgana | |
1137 | Magnatune Compilation | Tim Rayborn_ Quen a Virgen ben servir | World Fusion | |
611 | Burnshee Thornside | Ha-Keem | Blues and misc | |
1849 | Magnatune Compilation | Jag_ Juke Joint Boogie | New Age and Jazz | |
2052 | C Layne | How Soon I Forget | The Sun Will Come Out to Blind You | |
2021 | Sun Palace | Your Hands Lie Open | Into Heaven | |
1941 | Drop Trio | invisible pants | Cezanne |
As explained above, the vanilla Eucliden Distance in an early fusion approach is dominated by large feature-sets. Through scaling the feature-space we achieved equal influence for all feature-sets and features. Now, equal influence is not always the best choice fo music similarity. For example, the year and popularity feature we included into our feature vector are not an intrinsic music property. We just added them to cluster recordings of the same epoch together. Currently this feature has the same impact on the estimated similarity as timbre, rhythm and harmonics. When using many features it is commonly a good choice to apply different weights to them. Estimating these weights is generally achieved empirically.
# feature-set lengths and order
featureset_weights = [0.5, # rp
0.9] # ssd
def weighted_eucledian_distance(feature_space, query_vector, featureset_weights):
distances = (feature_space - query_vector)**2
# feature_start_idx
start_idx = 0
# normalize distances
for sequence_length, weight in zip(featureset_lengths, featureset_weights):
# feature_stop_idx
stop_idx = start_idx + sequence_length
distances[:,start_idx:stop_idx] /= sequence_length#distances[:,start_idx:stop_idx].sum(axis=1).max()
distances[:,start_idx:stop_idx] *= weight
start_idx = stop_idx
return np.sqrt(np.sum(distances, axis=1))
Example result:
dist = weighted_eucledian_distance(feature_data,
feature_data[metadata.iloc[query_track_idx].featurespace_id],
featureset_weights)
HTML(metadata.set_index("featurespace_id").loc[np.argsort(dist)[:11]][display_cols].to_html(escape=False))
artist | title | album | player | |
---|---|---|---|---|
featurespace_id | ||||
1102 | Chris Harvey | Pixelize | The White Sail | |
2318 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
315 | DJ Cary | Symphony of Force (Saros) | Power Synths | |
1137 | Magnatune Compilation | Tim Rayborn_ Quen a Virgen ben servir | World Fusion | |
24 | Mijo | Click Here | Fata Morgana | |
2774 | Jami Sieber | Dancing at the Temple Gate | Lush Mechanique | |
611 | Burnshee Thornside | Ha-Keem | Blues and misc | |
1849 | Magnatune Compilation | Jag_ Juke Joint Boogie | New Age and Jazz | |
1941 | Drop Trio | invisible pants | Cezanne | |
1921 | Chris Juergensen | Prospects | Prospects | |
883 | Chris Juergensen | Papa legba | Prospects |