Finding Similar Songs - Part 1: Distance Based Search¶

The first part of this tutorial series demonstrates a distance based similarity search approach, based on extracted content descriptors and distance metrics.

Part 1 - Overview¶

Introductions & Requirements
Loading data
Preprocess data
Define the Similarity Model
Optimize the Model

Short Introduction to Music Similarity Retrieval¶

The objective of Music Similarity estimation or retrieval is to estimate the notion of similarity between two given tracks. A central part of such an approaches is the definition of a measure for similarity which is further affected by the approach taken to extract the relevant information. One approach is to analyze contextual data such as user generated listening behaviour data (e.g. play/skip-counts, user-tags, ratings, etc.). The approach followed by this tutorial is based on the music content itself and largely focuses on the notion of acoustic similarity. Music features are extracted from the audio content. The resulting music descriptors are high-dimensional numeric vectors and the accumulation of all feature vectors of a collection forms a vector-space. The general principle of content based similarity estimations is based on the assumption that numerical differences are an expression of perceptual dissimilarity. Different metrics such as the Manhattan (L1) or the Euclidean Distance (L2) or non-metric similarity functions such as the Kullback-Leibler divergence are used to estimate the numerical similarity of the feature vectors.

Requirements¶

Please follow the instructions on the tutorial's Github page to install the following dependencies to run this tutorial:

In [1]:

# visualization
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

# numeric and scientific processing
import numpy as np
import pandas as pd

# misc
import os
import progressbar

from IPython.display import HTML, display
pd.set_option('display.max_colwidth', -1)

Loading Data¶

Before we can train our models we first have to get some data.

In [2]:

DATASET_PATH    = "D:/Research/Data/MIR/MagnaTagATune/ISMIR2018"

Load Feature Data¶

load feature data from numpy pickle

In [3]:

with np.load("%s/ISMIR2018_tut_Magnagtagatune_rp_features.npz" % DATASET_PATH) as npz:
    features_rp   = npz["rp"]
    features_ssd  = npz["ssd"]
    clip_id       = npz["clip_id"]

prepare feature-metadata for alignment with dataset meta-data

In [4]:

feature_metadata = pd.DataFrame({"featurespace_id": np.arange(features_rp.shape[0]), 
                                 "clip_id"        : clip_id})

Load Metadata¶

load meta-data from csv-file.

In [5]:

metadata = pd.read_csv("./metadata/ismir2018_tut_part_3_similartiy_metadata.csv", index_col=0)
metadata.head()

Out[5]:

	clip_id	mp3_path	track_number	title	artist	album	url	segmentStart	segmentEnd	original_url
19200	42150	D:/Research/Data/MIR/MagnaTagATune/mp3_full/f/professor_armchair-too_much_mustard-10-bethena-117-146.mp3	10	Bethena	Professor Armchair	Too Much Mustard	http://www.magnatune.com/artists/albums/armchair-octopants/	117	146	http://he3.magnatune.com/all/10-Bethena-Professor%20Armchair.mp3
7745	16994	D:/Research/Data/MIR/MagnaTagATune/mp3_full/c/liquid_zen-seventythree-04-close-262-291.mp3	4	Close	Liquid Zen	Seventythree	http://www.magnatune.com/artists/albums/liquid-seventythree/	262	291	http://he3.magnatune.com/all/04-Close%20-%20Liquid%20Zen.mp3
25416	57455	D:/Research/Data/MIR/MagnaTagATune/mp3_full/6/doc_rossi-demarzi6_sonatas_for_cetra_o_kitara-23-sonata_vi_allegro-59-88.mp3	23	Sonata VI Allegro	Doc Rossi	Demarzi-6 Sonatas for Cetra o Kitara	http://www.magnatune.com/artists/albums/rossi-demarzi/	59	88	http://he3.magnatune.com/all/23-Sonata%20VI%20Allegro-Doc%20Rossi.mp3
613	1473	D:/Research/Data/MIR/MagnaTagATune/mp3_full/b/philharmonia_baroque-beethoven_symphonies_no_3_eroica_and_no_8-01-eroica_1st-117-146.mp3	1	Eroica 1st	Philharmonia Baroque	Beethoven Symphonies No 3 Eroica and No 8	http://www.magnatune.com/artists/albums/pb-eroica/	117	146	http://he3.magnatune.com/all/01-Eroica%201st-Philharmonia%20Baroque.mp3
15095	33011	D:/Research/Data/MIR/MagnaTagATune/mp3_full/0/barbara_leoni-human_needs-07-ring_around_the_rosey-88-117.mp3	7	Ring around the rosey	Barbara Leoni	Human Needs	http://www.magnatune.com/artists/albums/leoni-human/	88	117	http://he3.magnatune.com/all/07-Ring%20around%20the%20rosey-Barbara%20Leoni.mp3

Align featuredata with metadata

In [6]:

metadata = metadata.reset_index()
metadata = metadata.merge(feature_metadata, left_on="clip_id", right_on="clip_id", how="inner", left_index=True, right_index=False)
metadata = metadata.set_index("index")

Add Media-Player to Metadata

In [7]:

tmp                = metadata.mp3_path.str.split("/", expand=True)
metadata["player"] = '<audio src="http://localhost:9999/' + tmp[6] + '/' + tmp[7] +'" controls>'

Use jupyters display functions to enable HTML5 audio in pandas dataframe

Example

In [8]:

HTML(metadata.iloc[:3][["title", "player"]].to_html(escape=False))

Out[8]:

	title	player
index
19200	Bethena
7745	Close
25416	Sonata VI Allegro

Aggregate Feature Space - Early Fusion¶

In [9]:

feature_data = np.concatenate([features_rp, features_ssd], axis=1)

feature_data.shape

Out[9]:

(6380, 1608)

Align Meta-data and Feature-data¶

Sort Metadata by Feature-space ID => metadata is aligned to feature-data

In [10]:

metadata.sort_values("featurespace_id", inplace=True)

Select feature-data for this metadata

In [11]:

# subsample feature-space
feature_data = feature_data[metadata.featurespace_id]

# re-enumerate sub-sampled feature-space (alignment)
metadata["featurespace_id"] = np.arange(metadata.shape[0])

Normalize feature data¶

The feature vectors are composed of differnt feature-sets. All of them with different value ranges. While features such as Acousticness and Danceability are scaled between 0 and 1, the BPM values of the tempo feature ranges around 120 or higher. We apply Standard Score or Zero Mean and Unit Variance normalization to uniformly scale the value ranges of the features.

$$ z = {x- \mu \over \sigma} $$

In [12]:

# standardize sequential_features
feature_data -= feature_data.mean(axis=0)
feature_data /= feature_data.std(axis=0)

C:\anaconda\lib\site-packages\ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in true_divide
  This is separate from the ipykernel package so we can avoid doing imports until

In [13]:

feature_data = np.nan_to_num(feature_data, 0)

Calculate Similarities¶

This section describes the fundamentals of the content-based audio similarity search approach followed in this tutorial. Audio features are descriptive numbers calculated from the audio spectrum of a track. A good example is the Spectral Centroid, which can be interpreted as the center of gravity of an audio recording. It describes the average frequency weighted by its intensity and distinguishes brighter from darker sounds. Such features are usually calculated for several intervals of a track and finally aggregated into a single vector representation. The latter step, which is a requirement for many machine/statistical learning tasks, is accomplished by calculating statistical measures such as mean, standard deviation, etc.

In the following example, the Spectral Centroids of 10 different tracks are provided using their mean and standard deviation aggregations. Thus, the Spectral Centroid feature(-set) is represented by a two-dimensional feature vector such as the following example:

ID   Mean                  Standard Deviation
0    1517.5993814237531    291.1855836731788

In this example the center frequency is 1518 Hz and it deviates by 291 Hz. These numbers already describe the audio content and can be used to find similar tracks. The common approach to calcualte music similarity from audio content is based on vector difference. The assumption is, that similar audio feature-values correspond with similar audio content. Thus, feature vectors with smaller vector differences correspond to more similar tracks. The following data represents the extracted Spectral Centroids of our 10-tracks collection:

ID   Mean                  Standard Deviation
0    1517.5993814237531    291.1855836731788
1    1659.1988993873124    327.64811981777865
2    1507.4617047141264    340.8830079395701
3    1597.6019371942953    507.1007933367403
4    1498.8531206911534    288.3780838480238
5    535.5910732230583     89.90893994909047
6    2261.4032345595674    353.5971736260454
7    2331.881852844861     406.33517225264194
8    1868.690426450363     342.7489751514078
9    2204.6324484864085    328.94334883095553

The tracks have unique identifiers and we are using the track with ID 5 to search for similar items. This step requires a similarity metric, which defines how the vector distance has to be calculated as a single numeric value. The most common choices are the Manhattan (L1) and Euclidean (L2) distance measures. The Euclidean Distance is the square root of the sum of squared differences of two vectors. To calculate the Euclidean Distance between track 5 and track 0:

ID   Mean                  Standard Deviation
0    1517.5993814237531    291.1855836731788
5    535.5910732230583     89.90893994909047

we first compute the difference between the values of each vectors

982.008308           201.276644

square them to get the absolute magnitude:

964340.317375        40512.287309

and take the sum of these values:

1004852.6046840245

Per definition the square root has to be calculated from the sum, but this step is normally skipped because it does not alter the ranking and is processing intensive. By calculating the distance for all items in the collection, we retrieve a list of distance values where the smaller distances correspond to more similar audio content and the higher values should sound more dissimilar.

ID   Distance
0    1004852.6046840245
1    1319014.4646621975
2    1007520.5071585375
3    1301916.1177259558
4    967263.7731724023
5    0.0
6    3047959.100796666
7    3326786.1254441254
8    1841081.968976167
9    2842836.5609704787

To retrieve a ranked list of similar sounding tracks, the list of vector distances has to be ordered ascendingly.

ID   Distance
5    0.0
4    967263.7731724023
0    1004852.6046840245
2    1007520.5071585375
3    1301916.1177259558
1    1319014.4646621975
8    1841081.968976167
9    2842836.5609704787
6    3047959.100796666
7    3326786.1254441254

This so called vector space model is predominant in content based multimedia retrieval. The most crucial and problematic part is feature crafting, meaning that in the case in which the extracted numbers do not describe the audio well enough, the vector based similarity will also fail to provide results that are perceived as similar. The described approach requires the availability of all feature vectors of all items of a collection. Thus, the feature vectors must be stored. No matter which retrieval approach (pre-calculated / indexed / on demand) will be chosen, all features will be required at a certain time. Given that the feature extraction is an computationally expensive task (in terms of processing resources and total time), the extracted features are stored and made accessible using a common data format.

Euclidean Distance¶

In the final part of this tutorial we wil use the Euclidean Distance to calculate similarities between tracks. As mentioned above, the Euclidean Distance is a metric to calculate the distance between two vectors and thus is a function of dissimilarity. This means, vectors with smaller distance values are more similar than those with higher distances.

$$ d(p,q) = \sqrt{\sum_{i=1}^n (q_i-p_i)^2} $$

In [14]:

def eucledian_distance(feature_space, query_vector):
    
    return np.sqrt(np.sum((feature_space - query_vector)**2, axis=1))

For the rest of the tutorial we will use this song to demonstrate the results of the approach:

In [15]:

# display top-10 results (first track = query track)
display_cols = ["artist", "title", "album", "player"]

In [16]:

query_track_idx = 1102

HTML(metadata[display_cols].iloc[[query_track_idx]].to_html(escape=False))

Out[16]:

	artist	title	album	player
index
18465	Chris Harvey	Pixelize	The White Sail

The following lines of code implement the approach described above. First, the distances between the query vector and all other vectors of the collection are calculated. Then the distances are sorted ascnedingly to get the simlar tracks. Because the metric distance of identical vectors is 0, the top-most entry of the sorted list is always the query track.

In [17]:

# calculate the distance between the query-vector and all others
dist = eucledian_distance(feature_data, feature_data[metadata.iloc[query_track_idx].featurespace_id])

# sort the distances ascendingly - use sorted index
sorted_idx = np.argsort(dist)

HTML(metadata.set_index("featurespace_id").loc[sorted_idx[:11]][display_cols].to_html(escape=False))

Out[17]:

	artist	title	album
featurespace_id
1102	Chris Harvey	Pixelize	The White Sail
2774	Jami Sieber	Dancing at the Temple Gate	Lush Mechanique
2318	DJ Cary	Symphony of Force (Saros)	Power Synths
315	DJ Cary	Symphony of Force (Saros)	Power Synths
24	Mijo	Click Here	Fata Morgana
1137	Magnatune Compilation	Tim Rayborn_ Quen a Virgen ben servir	World Fusion
2052	C Layne	How Soon I Forget	The Sun Will Come Out to Blind You
2021	Sun Palace	Your Hands Lie Open	Into Heaven
2097	Curl	Sincerely Sorry	Inner
873	Atomic Opera	Meaningless Word	Alpha and Oranges
611	Burnshee Thornside	Ha-Keem	Blues and misc

Scaled Eucledian Distance¶

The approach taken to combine the different feature-sets is refered to as early fusion. The problem with the approach described in the previous step is, that larger feature-sets dominate the calculated distance values. The aggregated MFCC and Chroma features have 24 dimensions each. Together they have more dimensions as the remaining features which are mostly single dimensional features. Thus, the distances are unequally dominated by the two feature sets.

To avoid such a bias, we scale the feature-space such that feature-sets and single-value features have euqal the same weights and thus euqal influence on the resulting distance.

In [18]:

# feature-set lengths and order
featureset_lengths = [1440, # rp
                      168]  # ssd

In [19]:

def scaled_eucledian_distance(feature_space, query_vector):
    
    distances = (feature_space - query_vector)**2
    
    # feature_start_idx
    start_idx = 0 
    
    # normalize distances
    for sequence_length in featureset_lengths:
        
        # feature_stop_idx
        stop_idx                         = start_idx + sequence_length
        distances[:,start_idx:stop_idx] /= sequence_length#distances[:,start_idx:stop_idx].sum(axis=1).max()
        start_idx                        = stop_idx
    
    return np.sqrt(np.sum(distances, axis=1))

Example result

In [20]:

dist = scaled_eucledian_distance(feature_data, feature_data[metadata.iloc[query_track_idx].featurespace_id])

HTML(metadata.set_index("featurespace_id").loc[np.argsort(dist)[:11]][display_cols].to_html(escape=False))

Out[20]:

	artist	title	album
featurespace_id
1102	Chris Harvey	Pixelize	The White Sail
2318	DJ Cary	Symphony of Force (Saros)	Power Synths
315	DJ Cary	Symphony of Force (Saros)	Power Synths
2774	Jami Sieber	Dancing at the Temple Gate	Lush Mechanique
24	Mijo	Click Here	Fata Morgana
1137	Magnatune Compilation	Tim Rayborn_ Quen a Virgen ben servir	World Fusion
611	Burnshee Thornside	Ha-Keem	Blues and misc
1849	Magnatune Compilation	Jag_ Juke Joint Boogie	New Age and Jazz
2052	C Layne	How Soon I Forget	The Sun Will Come Out to Blind You
2021	Sun Palace	Your Hands Lie Open	Into Heaven
1941	Drop Trio	invisible pants	Cezanne

Feature Weighting¶

As explained above, the vanilla Eucliden Distance in an early fusion approach is dominated by large feature-sets. Through scaling the feature-space we achieved equal influence for all feature-sets and features. Now, equal influence is not always the best choice fo music similarity. For example, the year and popularity feature we included into our feature vector are not an intrinsic music property. We just added them to cluster recordings of the same epoch together. Currently this feature has the same impact on the estimated similarity as timbre, rhythm and harmonics. When using many features it is commonly a good choice to apply different weights to them. Estimating these weights is generally achieved empirically.

In [21]:

# feature-set lengths and order
featureset_weights = [0.5, # rp
                      0.9]  # ssd

In [22]:

def weighted_eucledian_distance(feature_space, query_vector, featureset_weights):
    
    distances = (feature_space - query_vector)**2
    
    # feature_start_idx
    start_idx = 0 
    
    # normalize distances
    for sequence_length, weight in zip(featureset_lengths, featureset_weights):

        # feature_stop_idx
        stop_idx                         = start_idx + sequence_length
        distances[:,start_idx:stop_idx] /= sequence_length#distances[:,start_idx:stop_idx].sum(axis=1).max()
        distances[:,start_idx:stop_idx] *= weight
        start_idx                        = stop_idx

    return np.sqrt(np.sum(distances, axis=1))

Example result:

In [23]:

dist = weighted_eucledian_distance(feature_data, 
                                   feature_data[metadata.iloc[query_track_idx].featurespace_id], 
                                   featureset_weights)

HTML(metadata.set_index("featurespace_id").loc[np.argsort(dist)[:11]][display_cols].to_html(escape=False))

Out[23]:

	artist	title	album
featurespace_id
1102	Chris Harvey	Pixelize	The White Sail
2318	DJ Cary	Symphony of Force (Saros)	Power Synths
315	DJ Cary	Symphony of Force (Saros)	Power Synths
1137	Magnatune Compilation	Tim Rayborn_ Quen a Virgen ben servir	World Fusion
24	Mijo	Click Here	Fata Morgana
2774	Jami Sieber	Dancing at the Temple Gate	Lush Mechanique
611	Burnshee Thornside	Ha-Keem	Blues and misc
1849	Magnatune Compilation	Jag_ Juke Joint Boogie	New Age and Jazz
1941	Drop Trio	invisible pants	Cezanne
1921	Chris Juergensen	Prospects	Prospects
883	Chris Juergensen	Papa legba	Prospects