#!/usr/bin/env python
# coding: utf-8

# # In Search of a New Musical Genre, Using Spotify Data
# **Alex Nisnevich**

# ## Introduction

# I want to come up with a new musical genre.
# 
# Let's think of a genre as a set of characteristics that songs share. If we look at the space of all potential characteristics of songs, genres are subspaces of it. The most clearly defined genres should intuitively be clearly separated from other genres without too much overlap.
# 
# So, how does one go about finding a new set of musical characteristics to create an original genre of music from? Well, we'd have to find a subspace (a "niche") that's unoccupied, and fill it.
# 
# Let's try to find such a niche with data.

# ## I. Data Collection

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import matplotlib.pyplot as plt


# What better dataset to use than Spotify itself? I cooked up a quick [Ruby script](https://github.com/AlexNisnevich/spotify-genre-features) to extract 50 (hopefully representative) songs for each genre within Spotify and get feature vectors of each song's ["audio features"](https://developer.spotify.com/web-api/get-several-audio-features/), a set of variables ranging from "acousticness" to valence ("musical positivity").

# In[2]:


songs = pd.read_csv('https://raw.githubusercontent.com/AlexNisnevich/spotify-genre-features/master/features.csv')


# Here's a sample of the kind of musical characteristics that we're dealing with:

# In[3]:


songs.sample(5)


# For the most part, we're interested in genres, not individual songs, so let's pivot by genre (taking the mean of each feature):

# In[4]:


genres = songs.pivot_table(index=['genre'])


# In[5]:


genres.head()


# ## II. Exploration

# Now let's plot some features against one another and see if we find any obvious gaps.
# 
# First, let's try plotting tempo against loudness:

# In[6]:


fig, ax = plt.subplots(figsize=(15, 15))
genres.plot.scatter('tempo', 'loudness', ax=ax, s=0) # To get the right x/y bounds
ax.set_xlabel('tempo')
ax.set_ylabel('loudness')
for k, v in genres.iterrows():
    ax.annotate(k, (v['tempo'], v['loudness']))


# Nothing too exciting here. Most genres are clustered in a tight ball around 110-140 bpm and loudness in the -13 to -5 dB range.

# Let's try using more expressive features. How about valence (musical positivity) and danceability?

# In[7]:


fig, ax = plt.subplots(figsize=(20, 20))
ax.set_xlabel('valence')
ax.set_ylabel('danceability')
for k, v in genres.iterrows():
    ax.annotate(k, (v['valence'], v['danceability']))


# The pattern is a little more interesting now: valence and danceability tend to be correlated (which makes sense!), but, for example, techno is disproportionally danceable given its middling positivity, while rockabilly is disproportionally positive compared to its danceability. So there's a bit of a spectrum there. That makese sense too.
# 
# Still, no obvious gaps ...

# How about valence vs. energy? These two seem to be less closely related to one another (for example, metal is very high-energy and typically low-valence), and thus should result in a more interesting graph.
# 
# And we're right:

# In[8]:


fig, ax = plt.subplots(figsize=(20, 20))
ax.set_xlabel('Valence (musical positiveness conveyed by a track)')
ax.set_ylabel('Energy (perceptual measure of intensity and activity)')
for k, v in genres.iterrows():
    ax.annotate(k, (v['valence'], v['energy']))


# Now _that's_ interesting, the genres are kind of all over the place now. And, look at that big empty circle in the middle!
# 
# That empty gap, between MPB and "latin" (or techno) in energy and between trip-hop and funk in valence, seems like just the sort of place for a new genre to arise.

# A natural question arises: is there _any_ music that fits that gap? Is it a gap from a song-by-song perspective or only from a genre-by-genre perspective? Let's examine that by plotting all our songs:

# In[9]:


fig, ax = plt.subplots(figsize=(20, 20))
ax.set_xlabel('Valence (musical positiveness conveyed by a track)')
ax.set_ylabel('Energy (perceptual measure of intensity and activity)')
ax.plot(songs.valence, songs.energy, marker='o', linestyle='', ms=1)
for k, v in genres.iterrows():
    ax.annotate(k, (v['valence'], v['energy']))


# Ah, looks like there's definitely songs occupying this gap.

# ## III. Digging Deeper

# Let's zoom in on the hole we found:

# In[10]:


fig, ax = plt.subplots(figsize=(10, 10))
ax.set_xlabel('Valence')
ax.set_ylabel('Energy')
ax.plot(songs.valence, songs.energy, marker='o', linestyle='', ms=1)
ax.set_xbound([0.4, 0.75])
ax.set_ybound([0.5, 0.75])
for k, v in genres.iterrows():
    ax.annotate(k, (v['valence'], v['energy']))


# Now let's zoom in even further, to find the exact tracks that are closest to our "target" (halfway between MPB and Latin, and halfway between funk and trip-hop)

# In[11]:


x_target = (genres['valence']['trip-hop'] + genres['valence']['funk']) / 2
y_target = (genres['energy']['latin'] + genres['energy']['mpb']) / 2


# In[12]:


fig, ax = plt.subplots(figsize=(10, 10))
ax.set_xlabel('Valence')
ax.set_ylabel('Energy')
ax.set_xbound([x_target - 0.03, x_target + 0.03])
ax.set_ybound([y_target - 0.03, y_target + 0.03])
ax.annotate('+', (x_target, y_target))
for k, v in songs.iterrows():
    ax.annotate(v['id'], (v['valence'], v['energy']))


# And what _are_ the closest songs to our target?

# In[13]:


songs['distance'] = abs(songs['valence'] - x_target) ** 2 + abs(songs['energy'] - y_target) ** 2


# In[14]:


songs.sort_values('distance')[:10]


# Decoding the Spotify song hashes gives us our five songs most emblematic of this "new" potential genre, and it's quite the eclectic mix!

# - [Fly with me (『莫非,這就是愛情』片頭曲) By 溫嵐](https://open.spotify.com/track/7n5q5KCdG2BHlfA92q60hA)
# - [La Vista Gorda by Fernando Otero](https://open.spotify.com/track/6IZBT8fnxNMqNGsNteBQlO)
# - [Wooly Bully by Sam The Sham & The Pharaohs](https://open.spotify.com/track/3JTpSzfNBHhK6qWoJVMDTQ)
# - [Music Prayer For Peace by Ernie Watts](https://open.spotify.com/track/3rtCErcwp0iPjvBNgSvhjy)
# - [Me Duele Amarte by Reik](https://open.spotify.com/track/2VLcpwzuZVSZQUCQDQRM8n)

# ## Conclusion / Future Work

# This has of course been rather silly, but still an interesting piece of data exploration.
# 
# Using only Spotify song features, we found a musical niche (within the valence/energy plane) that seems to not be filled by any one genre, though occupied by a diverse group of songs.
# 
# If I were to work on this idea more, here's some other things I'd like to try:
# - Looking at higher-dimensional data (e.g. 3d plots ... dimensionality reduction via PCA ...) to find holes
# - A more principled method of finding holes that doesn't rely on visual inspection
# - Observing the clusters formed by each genre rather than just taking the mean of each song within it
# - More data!