We merge the two columns instrumental and vocal into one. Before that, we check for ambiguous entries.
import numpy as np
import pandas as pd
from os.path import join
# DEFINE PATHS and FILE NAMES
METADATA_PATH = 'metadata'
IN_LABEL_FILE = join(METADATA_PATH, 'ismir2018_tut_part_1_instrumental_labels_subset.csv')
OUT_LABEL_FILE = join(METADATA_PATH, 'ismir2018_tut_part_1_instrumental_labels_subset_post.csv')
labels = pd.read_csv(IN_LABEL_FILE, index_col=0) #, sep='\t')
labels.head()
instrumental | singing | |
---|---|---|
14 | 0.0 | 1.0 |
17 | 0.0 | 1.0 |
88 | 1.0 | 0.0 |
103 | 0.0 | 1.0 |
128 | 0.0 | 1.0 |
# check class distribution
labels.sum()
instrumental 443.0 singing 1283.0 dtype: float64
# we already removed the tracks that were *neither* instrumental *nor* vocal
labels.sum(axis=1).min()
1.0
# unfortunately a few songs are labeled *both* instrumental *and* vocal
labels.sum(axis=1).max()
2.0
We remove the ambiguously annotated tracks by using XOR to keep only tracks that are either instrumental or vocal
retain = np.logical_xor(labels['instrumental'], labels['singing'])
retain.head()
14 True 17 True 88 True 103 True 128 True Name: instrumental, dtype: bool
# keep only ones that are set "True" in retain
n_orig = len(labels)
n_retain = sum(retain)
labels = labels[retain]
print("For instrumental vs. vocal, from originally", n_orig, "input examples, we can only retain",n_retain, "trusted ones in our groundtruth")
For instrumental vs. vocal, from originally 1703 input examples, we can only retain 1680 trusted ones in our groundtruth
# check class distribution after removal of ambigous tracks
labels.sum()
instrumental 420.0 singing 1260.0 dtype: float64
labels.head()
instrumental | singing | |
---|---|---|
14 | 0.0 | 1.0 |
17 | 0.0 | 1.0 |
88 | 1.0 | 0.0 |
103 | 0.0 | 1.0 |
128 | 0.0 | 1.0 |
# keep only 1 column as now they are redundant (one is the inverse of the other)
labels = labels[['instrumental']]
labels.head()
instrumental | |
---|---|
14 | 0.0 |
17 | 0.0 |
88 | 1.0 |
103 | 0.0 |
128 | 0.0 |
# double-check number of instrumental tracks
labels.sum()
instrumental 420.0 dtype: float64
# export file under new filename
labels.to_csv(OUT_LABEL_FILE)
print("Wrote " + OUT_LABEL_FILE)
Wrote metadata/ismir2018_tut_part_1_instrumental_labels_subset_post.csv
We retain only moods with a certain minimum number of instances.
# DEFINE PATHS and FILE NAMES
METADATA_PATH = 'metadata'
IN_LABEL_FILE = join(METADATA_PATH, 'ismir2018_tut_part_1_moods_labels_subset.csv')
OUT_LABEL_FILE = join(METADATA_PATH, 'ismir2018_tut_part_1_moods_labels_subset_post.csv')
MIN_MOODS = 100
metadata = pd.read_csv(IN_LABEL_FILE, index_col=0) #, sep='\t')
metadata.head()
airy | calm | dark | deep | different | eerie | happy | light | loud | low | mellow | quiet | sad | scary | soft | strange | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13708 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2697 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
17495 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20431 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
7423 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
# how many tracks per mood
metadata.sum()
airy 7 calm 25 dark 37 deep 17 different 23 eerie 14 happy 9 light 12 loud 209 low 21 mellow 16 quiet 177 sad 14 scary 9 soft 200 strange 120 dtype: int64
metadata.sum() >= MIN_MOODS
airy False calm False dark False deep False different False eerie False happy False light False loud True low False mellow False quiet True sad False scary False soft True strange True dtype: bool
cols_retain = metadata.columns[(metadata.sum() >= MIN_MOODS)]
cols_retain
Index(['loud', 'quiet', 'soft', 'strange'], dtype='object')
metadata = metadata[cols_retain]
metadata.head()
loud | quiet | soft | strange | |
---|---|---|---|---|
13708 | 0 | 0 | 0 | 1 |
2697 | 0 | 0 | 0 | 0 |
17495 | 1 | 0 | 0 | 0 |
20431 | 0 | 0 | 0 | 1 |
7423 | 0 | 0 | 1 | 1 |
# info: maximum number of concurrent moods
metadata.sum(axis=1).max()
3
# export file under new filename
metadata.to_csv(OUT_LABEL_FILE)
print("Wrote " + OUT_LABEL_FILE)
Wrote metadata/ismir2018_tut_part_1_moods_labels_subset_post.csv
No postprocessing. For filename compatibility we just copied the genres labels file to genres_post.csv.