Spotify has quickly become the most popular music streaming service in the world with over 271 million active users every month. A close collaborator of Spotify is Genius, a website with 26.5 million monthly users that allows members of their community to upload, annotate, and interpret lyrics from music artists. Inspired by the blog of Thompson Analytics I will use the Spotify API, the Genius API and the NRC Emotion Lexicon to quantify both the musical sentiment and lyrical sentiment of one of the most prominentes artists of our time: Kendrick Lamar. A visualization of his discography will accompany this process of data collection. In addition to this, I will track the data of the top 100 most streamed songs of all time and will use different regression techniques to determine the usefulness of these features when it comes to the prediction of musical success.
The Spotify API assigns musical features to every track on their platform. These features are defined in the following way:
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
I will be using these features for visualization and analytical purposes in the first and second part respectively.
pip install spotipy --upgrade
Requirement already up-to-date: spotipy in /opt/conda/lib/python3.7/site-packages (2.11.2) Requirement already satisfied, skipping upgrade: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (1.14.0) Requirement already satisfied, skipping upgrade: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (2.23.0) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2019.11.28) Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (1.25.7) Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2.9) Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (3.0.4) Note: you may need to restart the kernel to use updated packages.
I will be using Plotly for most visualization tasks since I love how crisp it looks. For the second part of this project I will be relying on Scikit-learn for the computation of different prediction model. In addition to this I need to set the credentials obtained from my Spotify API application to access their data.
# import libraries
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
from datetime import date
import matplotlib.pyplot as plt
import pandas as pd
from math import pi
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
from sklearn import (linear_model, metrics, neural_network, pipeline, model_selection, preprocessing, pipeline)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from scipy import stats
# authenticate and connect to the Spotify API
client_id = '547191f3201147df8e76a2aa96607aa3'
client_secret = '43cb18993cf04a53866a3690d1796600'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
Next, I define a series of functions that will allow me to gather data from the Spotufy API and put it into a dataframe:
# Definining functions
def get_artist(name):
# Returns the artist object with respective data
results = sp.search(q='artist:' + name, type='artist')
items = results['artists']['items']
if len(items) > 0:
return items[0]
else:
return None
def get_artist_albums_ids(artist):
# Given an artist object, returns a list of with each of the album's ids
albums_ids = []
for i in range (len(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'])):
albums_ids.append(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'][i]['id'])
return albums_ids
def filter_albums_ids(albums_ids):
# Spotify has many versions of the same album.
# This function filters a list of albums ids and returns the most popular ones
album_names = []
for album_id in albums_ids:
album_names.append(sp.album(album_id)['name'])
album_pop = []
for album_id in albums_ids:
album_pop.append(sp.album(album_id)['popularity'])
d = {'id': albums_ids,'name': album_names, 'popularity': album_pop}
df = pd.DataFrame(data=d)
df = df.sort_values('popularity', ascending=False).drop_duplicates('name').sort_index()
df = df.reset_index(drop = True)
return (df['id'])
def get_artist_albums_tracks_ids(albums_ids):
# Given a list of albums ids, returns a list of ids of the albums' tracks
tracks_ids = []
for album_id in albums_ids:
for i in range(len(sp.album_tracks(album_id)['items'])):
tracks_ids.append(sp.album_tracks(album_id)['items'][i]['id'])
return(tracks_ids)
def get_track_features(id):
# Given a track id, returns a nested list with its musical features as described in the introduction
meta = sp.track(id)
features = sp.audio_features(id)
# Meta
name = meta['name']
album = meta['album']['name']
artist = meta['album']['artists'][0]['name']
release_date = meta['album']['release_date']
length = meta['duration_ms']
popularity = meta['popularity']
# Features
acousticness = features[0]['acousticness']
danceability = features[0]['danceability']
energy = features[0]['energy']
instrumentalness = features[0]['instrumentalness']
liveness = features[0]['liveness']
loudness = features[0]['loudness']
speechiness = features[0]['speechiness']
valence = features[0]['valence']
tempo = features[0]['tempo']
time_signature = features[0]['time_signature']
track = [name, album, artist, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, valence, tempo, time_signature]
return track
def get_tracks_features(tracks_ids):
# given a list of ids of tracks, returns a list with their features
tracks = []
for i in range(0, len(tracks_ids)):
track = get_track_features(tracks_ids[i])
tracks.append(track)
return(tracks)
def tracks_features_to_csv(tracks_features, csv_title):
# transforms the list of track features into a csv file
df = pd.DataFrame(tracks_features, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'valence','tempo', 'time_signature'])
df.to_csv(csv_title + ".csv", sep = ',')
def get_discography(artist_name):
# a function that returns the discography of an artist given its name (string)
# do not use, it is quite unreliable since the discography is not filtered
tracks_features = get_tracks_features(get_artist_albums_tracks_ids(filter_albums_ids(get_artist_albums_ids(get_artist(artist_name)))))
return tracks_features_to_csv(tracks_features, artist_name)
def get_playlist_track_ids(user, playlist_id):
# initially I wanted to use a playlist to get an artist's discography
# however, finding a good playlist proved to be quite difficult
ids = []
playlist = sp.user_playlist(user, playlist_id)
for item in playlist['tracks']['items']:
track = item['track']
ids.append(track['id'])
return ids
Done! Let's see the functions in action.
# first, we find the artist object
kendrick = get_artist("Kendrick Lamar")
# we look for his albums
albums_ids = get_artist_albums_ids(kendrick)
# we filter this list since spotify has many versions of the same album. We keep the most popular one
filtered_albums_ids = filter_albums_ids(albums_ids)
# we get the id of every track in his discography
tracks_ids = get_artist_albums_tracks_ids(filtered_albums_ids)
# we get the features of every track
tracks_features = get_tracks_features(tracks_ids)
# we export the data to csv format
tracks_features_to_csv(tracks_features, "Kendrick Lamar")
The API tends to be quite unreliable. At times the process might yield an error so the previous step might require more than one attempo. Anyways, I got the data in the second try, let's see how it looks
df_kendrick = pd.read_csv("Kendrick Lamar.csv",index_col=[0])
print(df_kendrick['album'].unique())
df_kendrick
['Black Panther The Album Music From And Inspired By' 'DAMN. COLLECTORS EDITION.' 'DAMN.' 'untitled unmastered.' 'To Pimp A Butterfly' 'good kid, m.A.A.d city (Deluxe)' 'good kid, m.A.A.d city' 'Section.80' 'Overly Dedicated']
name | album | artist | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Black Panther | Black Panther The Album Music From And Inspire... | Kendrick Lamar | 2018-02-09 | 130613 | 58 | 0.618 | 0.6250 | 0.582 | 0.000004 | 0.2650 | -9.454 | 0.2970 | 0.480 | 90.035 | 4 |
1 | All The Stars (with SZA) | Black Panther The Album Music From And Inspire... | Kendrick Lamar | 2018-02-09 | 232186 | 79 | 0.698 | 0.0605 | 0.633 | 0.000194 | 0.0926 | -4.946 | 0.0597 | 0.552 | 96.924 | 4 |
2 | X (with 2 Chainz & Saudi) | Black Panther The Album Music From And Inspire... | Kendrick Lamar | 2018-02-09 | 267426 | 70 | 0.768 | 0.0201 | 0.471 | 0.000000 | 0.2680 | -8.406 | 0.2590 | 0.405 | 131.023 | 4 |
3 | The Ways (with Swae Lee) | Black Panther The Album Music From And Inspire... | Kendrick Lamar | 2018-02-09 | 238893 | 66 | 0.727 | 0.0626 | 0.720 | 0.000001 | 0.1760 | -5.856 | 0.0488 | 0.589 | 140.080 | 4 |
4 | Opps (with Yugen Blakrok) | Black Panther The Album Music From And Inspire... | Kendrick Lamar | 2018-02-09 | 180893 | 60 | 0.706 | 0.1520 | 0.775 | 0.000033 | 0.4160 | -6.819 | 0.3350 | 0.847 | 127.929 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
118 | Barbed Wire | Overly Dedicated | Kendrick Lamar | 2010-09-14 | 265678 | 44 | 0.613 | 0.0152 | 0.843 | 0.000000 | 0.0917 | -8.343 | 0.1860 | 0.325 | 102.968 | 4 |
119 | Average Joe | Overly Dedicated | Kendrick Lamar | 2010-09-14 | 256048 | 46 | 0.740 | 0.2430 | 0.733 | 0.000000 | 0.1840 | -3.343 | 0.2480 | 0.218 | 91.603 | 4 |
120 | H.O.C | Overly Dedicated | Kendrick Lamar | 2010-09-14 | 316975 | 44 | 0.613 | 0.1050 | 0.591 | 0.000000 | 0.1910 | -8.580 | 0.3910 | 0.371 | 77.124 | 4 |
121 | Cut You Off (To Grow Closer) | Overly Dedicated | Kendrick Lamar | 2010-09-14 | 364103 | 47 | 0.685 | 0.0770 | 0.681 | 0.000000 | 0.1290 | -7.176 | 0.4810 | 0.614 | 82.982 | 4 |
122 | She Needs Me (Remix) | Overly Dedicated | Kendrick Lamar | 2010-09-14 | 195790 | 53 | 0.606 | 0.4170 | 0.835 | 0.000001 | 0.1030 | -6.107 | 0.2650 | 0.313 | 100.145 | 4 |
123 rows × 16 columns
Looking good. However, notice that we still got some repeated observations because of the deluxe edition of good kid, m.A.A.d city and the collector's edition of DAMN.. I will keep the former and drop the latter because of popularity reasons. Moreover, I will be dropping the Black Panther's soundtrack since I believe it does not stand as a singular effort by Kendrick, but rather as a collective effort that seeks to accompany the movie.
df_kendrick = df_kendrick[df_kendrick['album'] != 'good kid, m.A.A.d city']
df_kendrick = df_kendrick[df_kendrick['album'] != 'Black Panther The Album Music From And Inspired By']
df_kendrick = df_kendrick[df_kendrick['album'] != 'DAMN. COLLECTORS EDITION.']
df_kendrick = df_kendrick.reset_index(drop=True)
df_kendrick.head()
name | album | artist | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BLOOD. | DAMN. | Kendrick Lamar | 2017-04-14 | 118066 | 61 | 0.357 | 0.14200 | 0.238 | 0.085900 | 0.5500 | -16.780 | 0.265 | 0.494 | 156.907 | 4 |
1 | DNA. | DAMN. | Kendrick Lamar | 2017-04-14 | 185946 | 78 | 0.638 | 0.00454 | 0.523 | 0.000000 | 0.0842 | -6.664 | 0.357 | 0.422 | 139.913 | 4 |
2 | YAH. | DAMN. | Kendrick Lamar | 2017-04-14 | 160293 | 64 | 0.670 | 0.57600 | 0.700 | 0.000005 | 0.2260 | -7.893 | 0.196 | 0.648 | 69.986 | 4 |
3 | ELEMENT. | DAMN. | Kendrick Lamar | 2017-04-14 | 208733 | 71 | 0.748 | 0.20400 | 0.705 | 0.000000 | 0.2460 | -4.547 | 0.485 | 0.483 | 189.891 | 4 |
4 | FEEL. | DAMN. | Kendrick Lamar | 2017-04-14 | 214826 | 64 | 0.746 | 0.13700 | 0.798 | 0.000000 | 0.1390 | -8.382 | 0.349 | 0.553 | 109.968 | 4 |
I will be using the LyricsGenius package that provides a simple interface to the song, artist, and lyrics data stored on Genius.
pip install git+https://github.com/johnwmillr/LyricsGenius.git
Collecting git+https://github.com/johnwmillr/LyricsGenius.git Cloning https://github.com/johnwmillr/LyricsGenius.git to /tmp/pip-req-build-_w3rl_eu Running command git clone -q https://github.com/johnwmillr/LyricsGenius.git /tmp/pip-req-build-_w3rl_eu Requirement already satisfied (use --upgrade to upgrade): lyricsgenius==1.8.2 from git+https://github.com/johnwmillr/LyricsGenius.git in /opt/conda/lib/python3.7/site-packages Requirement already satisfied: beautifulsoup4==4.6.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (4.6.0) Requirement already satisfied: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (2.23.0) Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (1.25.7) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2019.11.28) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2.9) Building wheels for collected packages: lyricsgenius Building wheel for lyricsgenius (setup.py) ... done Created wheel for lyricsgenius: filename=lyricsgenius-1.8.2-py3-none-any.whl size=15038 sha256=8cd1776e3a526e58217a9324415b1e95849fa37880a194dc8f2d22f4fd7ab40f Stored in directory: /tmp/pip-ephem-wheel-cache-60w3__2c/wheels/12/d5/2b/6b771ebb067bceb8816ec5eef0dd0d36bf069b18f03ac8ca20 Successfully built lyricsgenius Note: you may need to restart the kernel to use updated packages.
Like before, we set the credentials obtained from my Genius app.
import lyricsgenius
# Genius API
genius = lyricsgenius.Genius("BCf3jrIhyBKfxJSB7t8ELCDvW9ASizVNtK8VCaka8TuM3Igd3Tma5eDumwizSicV")
genius.remove_section_headers = True
My plan is to iterate through my dataframe looking for every song name and creating a list with the lyrics. I decided against applying a function to the dataframe since the Genius API is very unreliable, (it like to throw out errors at random: at times it works, at times is just does not!). Albeit this approach is more lengthy, it is much more reliable. Before doing any of this, I will need to do some name standarization between my dataframe and the Genius API (titles that include the word featuring are problematic). Please forgive the profanity in the following cell.
df_kendrick.loc[df_kendrick['name'] == "Swimming Pools (Drank) - Extended Version", 'name'] = "Swimming Pools (Drank)"
df_kendrick.loc[df_kendrick['name'] == "F*ck Your Ethnicity", 'name'] = "Fuck Your Ethnicity"
df_kendrick.loc[df_kendrick['name'] == "LOYALTY. FEAT. RIHANNA.", 'name'] = "LOYALTY."
df_kendrick.loc[df_kendrick['name'] == "LOVE. FEAT. ZACARI.", 'name'] = "LOVE."
df_kendrick.loc[df_kendrick['name'] == "XXX. FEAT. U2.", 'name'] = "XXX."
We are good to go! Let's see how the loop performs.
lyrics = []
for name in df_kendrick['name']:
lyrics.append(genius.search_song(name, "Kendrick Lamar"))
Searching for "BLOOD." by Kendrick Lamar... Done. Searching for "DNA." by Kendrick Lamar... Done. Searching for "YAH." by Kendrick Lamar... Done. Searching for "ELEMENT." by Kendrick Lamar... Done. Searching for "FEEL." by Kendrick Lamar... Done. Searching for "LOYALTY." by Kendrick Lamar... Done. Searching for "PRIDE." by Kendrick Lamar... Done. Searching for "HUMBLE." by Kendrick Lamar... Done. Searching for "LUST." by Kendrick Lamar... Done. Searching for "LOVE." by Kendrick Lamar... Done. Searching for "XXX." by Kendrick Lamar... Done. Searching for "FEAR." by Kendrick Lamar... Done. Searching for "GOD." by Kendrick Lamar... Done. Searching for "DUCKWORTH." by Kendrick Lamar... Done. Searching for "untitled 01 | 08.19.2014." by Kendrick Lamar... Done. Searching for "untitled 02 | 06.23.2014." by Kendrick Lamar... Done. Searching for "untitled 03 | 05.28.2013." by Kendrick Lamar... Done. Searching for "untitled 04 | 08.14.2014." by Kendrick Lamar... Done. Searching for "untitled 05 | 09.21.2014." by Kendrick Lamar... Done. Searching for "untitled 06 | 06.30.2014." by Kendrick Lamar... Done. Searching for "untitled 07 | 2014 - 2016" by Kendrick Lamar... Done. Searching for "untitled 08 | 09.06.2014." by Kendrick Lamar... Done. Searching for "Wesley's Theory" by Kendrick Lamar... Done. Searching for "For Free? - Interlude" by Kendrick Lamar... Done. Searching for "King Kunta" by Kendrick Lamar... Done. Searching for "Institutionalized" by Kendrick Lamar... Done. Searching for "These Walls" by Kendrick Lamar... Done. Searching for "u" by Kendrick Lamar... Done. Searching for "Alright" by Kendrick Lamar... Done. Searching for "For Sale? - Interlude" by Kendrick Lamar... Done. Searching for "Momma" by Kendrick Lamar... Done. Searching for "Hood Politics" by Kendrick Lamar... Done. Searching for "How Much A Dollar Cost" by Kendrick Lamar... Done. Searching for "Complexion (A Zulu Love)" by Kendrick Lamar... Done. Searching for "The Blacker The Berry" by Kendrick Lamar... Done. Searching for "You Ain't Gotta Lie (Momma Said)" by Kendrick Lamar... Done. Searching for "i" by Kendrick Lamar... Done. Searching for "Mortal Man" by Kendrick Lamar... Done. Searching for "Sherane a.k.a Master Splinter’s Daughter" by Kendrick Lamar... Done. Searching for "Bitch, Don’t Kill My Vibe" by Kendrick Lamar... Done. Searching for "Backseat Freestyle" by Kendrick Lamar... Done. Searching for "The Art of Peer Pressure" by Kendrick Lamar... Done. Searching for "Money Trees" by Kendrick Lamar... Done. Searching for "Poetic Justice" by Kendrick Lamar... Done. Searching for "good kid" by Kendrick Lamar... Done. Searching for "m.A.A.d city" by Kendrick Lamar... Done. Searching for "Swimming Pools (Drank)" by Kendrick Lamar... Done. Searching for "Sing About Me, I'm Dying Of Thirst" by Kendrick Lamar... Done. Searching for "Real" by Kendrick Lamar... Done. Searching for "Compton" by Kendrick Lamar... Done. Searching for "The Recipe - Bonus Track" by Kendrick Lamar... Done. Searching for "Black Boy Fly - Bonus Track" by Kendrick Lamar... Done. Searching for "Now Or Never - Bonus Track" by Kendrick Lamar... Done. Searching for "The Recipe (Black Hippy Remix) - Bonus Track" by Kendrick Lamar... Done. Searching for "Bitch, Don’t Kill My Vibe - Remix" by Kendrick Lamar... Done. Searching for "Fuck Your Ethnicity" by Kendrick Lamar... Done. Searching for "Hol' Up" by Kendrick Lamar... Done. Searching for "A.D.H.D" by Kendrick Lamar... Done. Searching for "No Make-Up (Her Vice) (feat. Colin Munroe)" by Kendrick Lamar... Done. Searching for "Tammy's Song (Her Evils)" by Kendrick Lamar... Done. Searching for "Chapter Six" by Kendrick Lamar... Done. Searching for "Ronald Reagan Era" by Kendrick Lamar... Done. Searching for "Poe Mans Dreams (His Vice) (feat. GLC)" by Kendrick Lamar... Done. Searching for "Chapter Ten" by Kendrick Lamar... Done. Searching for "Keisha's Song (Her Pain) (feat. Ashtro Bot)" by Kendrick Lamar... Done. Searching for "Rigamortus" by Kendrick Lamar... Done. Searching for "Kush & Corinthians (feat. BJ The Chicago Kid)" by Kendrick Lamar... Done. Searching for "Blow My High (Members Only)" by Kendrick Lamar... Done. Searching for "Ab-Souls Outro (feat. Ab-Soul)" by Kendrick Lamar... Done. Searching for "HiiiPower" by Kendrick Lamar... Done. Searching for "Growing Apart (To Get Closer)" by Kendrick Lamar... Done. Searching for "Ignorance Is Bliss" by Kendrick Lamar... Done. Searching for "P&P 1.5" by Kendrick Lamar... Done. Searching for "Alien Girl (Today W/ Her)" by Kendrick Lamar... Done. Searching for "Opposites Attract (Tomorrow W/O Her)" by Kendrick Lamar... Done. Searching for "Michael Jordan" by Kendrick Lamar... Done. Searching for "R.O.T.C (Interlude)" by Kendrick Lamar... Done. Searching for "Barbed Wire" by Kendrick Lamar... Done. Searching for "Average Joe" by Kendrick Lamar... Done. Searching for "H.O.C" by Kendrick Lamar... Done. Searching for "Cut You Off (To Grow Closer)" by Kendrick Lamar... Done. Searching for "She Needs Me (Remix)" by Kendrick Lamar... Done.
A quck inspection of the list of lyrics reveals that we got a few None
values. Let's fix those:
while None in lyrics:
none_index = [i for i in range(len(lyrics)) if lyrics[i] == None]
for i in none_index:
lyrics[i] = genius.search_song(df_kendrick['name'][i], "Kendrick Lamar")
We are ready to extract the lyrics and append them into the dataframe.
lyrics_text = []
for lyric in lyrics:
lyrics_text.append(lyric.lyrics.lower())
df_kendrick['lyrics'] = lyrics_text
df_kendrick.head()
name | album | artist | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | time_signature | lyrics | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BLOOD. | DAMN. | Kendrick Lamar | 2017-04-14 | 118066 | 61 | 0.357 | 0.14200 | 0.238 | 0.085900 | 0.5500 | -16.780 | 0.265 | 0.494 | 156.907 | 4 | is it wickedness?\nis it weakness?\nyou decide... |
1 | DNA. | DAMN. | Kendrick Lamar | 2017-04-14 | 185946 | 78 | 0.638 | 0.00454 | 0.523 | 0.000000 | 0.0842 | -6.664 | 0.357 | 0.422 | 139.913 | 4 | i got, i got, i got, i got—\nloyalty, got roya... |
2 | YAH. | DAMN. | Kendrick Lamar | 2017-04-14 | 160293 | 64 | 0.670 | 0.57600 | 0.700 | 0.000005 | 0.2260 | -7.893 | 0.196 | 0.648 | 69.986 | 4 | new shit, new kung fu kenny\n\ni got so many t... |
3 | ELEMENT. | DAMN. | Kendrick Lamar | 2017-04-14 | 208733 | 71 | 0.748 | 0.20400 | 0.705 | 0.000000 | 0.2460 | -4.547 | 0.485 | 0.483 | 189.891 | 4 | new kung fu kenny\nain't nobody prayin' for me... |
4 | FEEL. | DAMN. | Kendrick Lamar | 2017-04-14 | 214826 | 64 | 0.746 | 0.13700 | 0.798 | 0.000000 | 0.1390 | -8.382 | 0.349 | 0.553 | 109.968 | 4 | ain't nobody prayin' for me\n(ain't nobody pra... |
In order to quantify lyrical sentiment I will use the NRC Emotion Lexicon. This dataset assigns different emotions or sentiments to english words. For example, the word abandon is mapped to the emotions of fear and sadness. My strategy is to compute the proportions for each of the sentiments. I recognize that this is a very limited approach to lyric analysis but at the same time I believe that it feels like a good starting point given the current constraints.
Let's create a function that given a song's lyrics and a emotion computes the number of words in the song associated to the emotion.
nrc = pd.read_table("NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", header=None, names=['word','emotion','dummy'])
def count_emotion(song_lyrics, emotion):
nrc_emotion = nrc[nrc['emotion'] == emotion]
nrc_emotion = nrc_emotion[nrc_emotion['dummy'] == 1]
nrc_emotion = nrc_emotion.reset_index(drop=True)
sum_emotion = 0
for word in nrc_emotion['word']:
sum_emotion = sum_emotion + song_lyrics.count(word)
return sum_emotion
Next, we create a function that computes the number of words associated to each emotion in a song and computes its proportions.
def emotion_prop(song_lyrics):
emotion_proportions = []
for emotion in nrc['emotion'].unique():
emotion_proportions.append(count_emotion(song_lyrics, emotion))
array = np.array(emotion_proportions)
return array/array.sum()
Finally, we append these proportions for each song in the dataframe:
for i in range(len(nrc['emotion'].unique())):
df_kendrick[nrc['emotion'].unique()[i]+"_index"] = df_kendrick.apply(lambda row: emotion_prop(row['lyrics'])[i], axis=1)
df_kendrick.head()
As desired. We are ready to begin visualizing Kendrick's discography!
First I wanted to see what type of lyrical and musical features dominate - in average - in each Kendrick album. To do so, I decided to produce an interactive radar plot with all albums overlayed in top of each other. This plot might look messy at first, but one can select (or unselect) each album by clicking over its legend or plot. In this way, we have a clear yet compact visualization.
lyrical_variables = nrc['emotion'].unique()+"_index"
musical_variables = ["danceability", "acousticness", "energy", "instrumentalness", "liveness","speechiness", "valence"]
def radar_discography_plot(df,x,musical,export=False):
artist = df['artist'][0]
df = df[::-1]
df = df.groupby('album',as_index=False,sort=False).mean()
df = df.drop(['length','popularity','loudness','tempo','time_signature'], axis = 1)
df_long = pd.melt(df, id_vars='album', value_vars=df.columns.values[1:])
if musical == True:
fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,1],
color_discrete_sequence=px.colors.qualitative.Bold)
else:
fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,0.3],
color_discrete_sequence=px.colors.qualitative.Bold)
if musical == True:
fig.update_layout(title = {'text':"Musical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
else:
fig.update_layout(title = {'text':"Lyrical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
fig.show()
if export == True:
fig.write_html(artist+".html")
First let's visualize the musical features of each of his albums.
radar_discography_plot(df_kendrick.drop(lyrical_variables,axis=1).copy(),0.45, musical = True)
radar_discography_plot(df_kendrick.drop(musical_variables,axis=1).copy(),0.45, musical = False)
Still, while this visualization is effective at comparing what features dominate in one particular album, it is not as effective for comparing one feature across his dicograpgy. For such purposes it is better to create a bar chart:
def bar_discography_plot(df,x,musical, export=False):
artist = df['artist'][0]
df = df[::-1]
df = df.groupby('album',as_index=False,sort=False).mean()
df = df.drop(['length','popularity','loudness','tempo','time_signature'], axis = 1)
df_long = pd.melt(df, id_vars='album', value_vars=df.columns.values[1:])
df_long
if musical == True:
fig = px.bar(df_long, x="variable", y="value", color='album', barmode='group', height=400, range_y = [0,1])
fig.update_layout(title = {'text':"Musical Features of " + artist + "s' Discography",'y':0.95,'x':x,'xanchor': 'center','yanchor': 'top'})
else:
fig = px.bar(df_long, x="variable", y="value", color='album', barmode='group', height=400, range_y = [0,0.3])
fig.update_layout(title = {'text':"Lyrical Features of " + artist + "s' Discography",'y':0.95,'x':x,'xanchor': 'center','yanchor': 'top'})
fig.show()
if export == True:
fig.write_html(artist+".html")
bar_discography_plot(df_kendrick.drop(lyrical_variables,axis=1).copy(),x=0.45,musical=True)
bar_discography_plot(df_kendrick.drop(musical_variables,axis=1).copy(),0.45, musical = False)
Overall, I would argue that these visualizations adequately represent Kendrick's music. For example, untitled unmastered stands out for its live instrumentation and somewhat subdued tone. On the other hand, Kendrick is characterized for his intertwined story telling: To Pimp a Butterfly references good kid, m.A.A.d city while DAMN. references its two predecessors. It is to be expected that his lyrics have pretty similar features across his discography. One final point of interest is the predominance of both negative and positive lyrics. While seemingly contradictory, duality has been a staple in Kendrick's music. For example, consider, u and i from To Pimp a Butterfly : members of the Genius community have noticed the the duality between both tracks: "u acts as a complete contrast to its lead single i, an anthem of peace, positivity, and prosperity starting with self-love".
I hope this first part of the project incentivizes the reader to explore the discography of their favorite artist!
While it was fun to look at Kendrick's discography, such data will not allow me to answer the main question that drives this research: what are the determinants of popularity in music? This is because the Spotify API does not allow us to count the number of times a song has been streamed. The closer feature to this would be a track's popularity , however, there are two issues: the algorithm that drives this variable is not know and the variable appears like a better indicator of current popularity - i.e what is trending - rather than all-time success. Thus I decided to use a dataset from Wikipedia that contains information about the Top 100 most streamed songs of all time. I will use once again similar methods to obtain both musical and lyrical features from each song.
df_wiki = pd.read_html("https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify#100_most-streamed_songs")[0]
df_wiki = df_wiki.drop(100)
df_wiki = df_wiki.drop('Rank',axis=1)
df_wiki.columns = ['name', 'artist','album','streams','release_date']
df_wiki
name | artist | album | streams | release_date | |
---|---|---|---|---|---|
0 | "Shape of You" | Ed Sheeran | ÷ | 2476 | 6 January 2017 |
1 | "Rockstar" | Post Malone featuring 21 Savage | Beerbongs & Bentleys | 1882 | 15 September 2017 |
2 | "One Dance" | Drake featuring Wizkid and Kyla | Views | 1840 | 5 April 2016 |
3 | "Closer" | The Chainsmokers featuring Halsey | Collage | 1757 | 29 July 2016 |
4 | "Thinking Out Loud" | Ed Sheeran | × | 1521 | 20 June 2014 |
... | ... | ... | ... | ... | ... |
95 | "Moonlight" | XXXTentacion | ? | 951 | 14 August 2018 |
96 | "Work" | Rihanna featuring Drake | Anti | 949 | 27 January 2016 |
97 | "Lovely" | Billie Eilish and Khalid | 13 Reasons Why: Season 2 (A Netflix Original S... | 952 | 19 April 2018 |
98 | "There's Nothing Holdin' Me Back" | Shawn Mendes | Illuminate | 945 | 20 April 2017 |
99 | "Me Rehúso" | Danny Ocean | 54+1 | 938 | 16 September 2016 |
100 rows × 5 columns
First I standarize the data such that we can find its musical features in Spotify first. I write a few functions that will help me for that task.
def fix_name(string):
return string.replace('"','')
def fix_featuring(string):
sep = ' featuring'
rest = string.split(sep, 1)[0]
return rest
def fix_and(string):
sep = ' and'
rest = string.split(sep, 1)[0]
return rest
def fix_apostrophe(string):
return(string.replace("'",""))
We are ready to standarize the data.
df_wiki['name'] = df_wiki['name'].apply(fix_name)
df_wiki['name'] = df_wiki['name'].apply(fix_apostrophe)
df_wiki['artist'] = df_wiki['artist'].apply(fix_featuring)
df_wiki['artist'] = df_wiki['artist'].apply(fix_and)
df_wiki.at[78, 'name'] = "I Don't Wanna Live Forever"
Now, we obtain the id's for every track
ids = []
for i in range(len(df_wiki['artist'])):
try:
artist= df_wiki['artist'][i]
track= df_wiki['name'][i]
track_id = sp.search(q='artist:' + artist + ' track:' + track, type='track')
ids.append(track_id['tracks']['items'][0]['id'])
except:
print(i)
Finally, we create the dataframe.
features = get_tracks_features(ids)
tracks_features_to_csv(features, "top100")
df_top = pd.read_csv("top100.csv",index_col=[0])
df_top['streams'] = df_wiki['streams']
df_top.head()
name | album | artist | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | time_signature | streams | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shape of You | ÷ (Deluxe) | Ed Sheeran | 2017-03-03 | 233712 | 86 | 0.825 | 0.58100 | 0.652 | 0.00000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 4 | 2476 |
1 | rockstar (feat. 21 Savage) | beerbongs & bentleys | Post Malone | 2018-04-27 | 218146 | 88 | 0.585 | 0.12400 | 0.520 | 0.00007 | 0.1310 | -6.136 | 0.0712 | 0.129 | 159.801 | 4 | 1882 |
2 | One Dance | Views | Drake | 2016-05-06 | 173986 | 82 | 0.792 | 0.00776 | 0.625 | 0.00188 | 0.3290 | -5.609 | 0.0536 | 0.370 | 103.967 | 4 | 1840 |
3 | Closer | Closer | The Chainsmokers | 2016-07-29 | 244960 | 85 | 0.748 | 0.41400 | 0.524 | 0.00000 | 0.1110 | -5.599 | 0.0338 | 0.661 | 95.010 | 4 | 1757 |
4 | Thinking out Loud | x (Deluxe Edition) | Ed Sheeran | 2014-06-21 | 281560 | 83 | 0.781 | 0.47400 | 0.445 | 0.00000 | 0.1840 | -6.061 | 0.0295 | 0.591 | 78.998 | 4 | 1521 |
We have the musical features with the number of streams. Now, let's add the lyrics! However, before adding them it is better to standarize the naming of songs across the current database and the Genius API since the wrapper for the latter is unreliable. This way, we minimize the chance of error.
def fix_featuring2(string):
sep = ' (feat'
rest = string.split(sep, 1)[0]
return rest
def fix_with(string):
sep = ' (with'
rest = string.split(sep, 1)[0]
return rest
df_top['name'] = df_top['name'].apply(fix_featuring2)
df_top['name'] = df_top['name'].apply(fix_with)
df_top.at[6, 'name'] = "Sunflower"
df_top.at[6, 'artist'] = "Post Malone"
df_top.at[20, 'artist'] = "Justin Bieber"
df_top.at[35, 'name'] = "Bohemian Rhapsody"
df_top.at[35, 'artist'] = "Queen"
df_top.at[49, 'artist'] = "Justin Bieber"
df_top.at[54, 'name'] = "Can't Stop the Feeling"
df_top.at[54, 'artist'] = "Justin Timberlake"
df_top.at[78, 'name'] = "I Don't Wanna Live Forever"
df_top.at[86, 'artist'] = "J Balvin"
# some addition fixes
df_top.at[65, 'release_date'] = '2015-03-10'
df_top.at[20, 'release_date'] = '2015-10-22'
lyrics = []
for i in range(len(df_top['name'])):
lyrics.append(genius.search_song(df_top['name'][i], df_top['artist'][i]))
Searching for "Shape of You" by Ed Sheeran... Done. Searching for "rockstar" by Post Malone... Done. Searching for "One Dance" by Drake... Done. Searching for "Closer" by The Chainsmokers... Done. Searching for "Thinking out Loud" by Ed Sheeran... Done. Searching for "God's Plan" by Drake... Done. Searching for "Sunflower" by Post Malone... Done. Searching for "Dance Monkey" by Tones And I... Done. Searching for "Havana" by Camila Cabello... Done. Searching for "Perfect" by Ed Sheeran... Done. Searching for "Say You Won't Let Go" by James Arthur... Done. Searching for "Love Yourself" by Justin Bieber... Done. Searching for "Señorita" by Shawn Mendes... Done. Searching for "Photograph" by Ed Sheeran... Done. Searching for "Lean On" by Major Lazer... Done. Searching for "Despacito - Remix" by Luis Fonsi... Done. Searching for "Believer" by Imagine Dragons... Done. Searching for "Starboy" by The Weeknd... Done. Searching for "New Rules" by Dua Lipa... Done. Searching for "bad guy" by Billie Eilish... Done. Searching for "Sorry" by Justin Bieber... Done. Searching for "Don't Let Me Down" by The Chainsmokers... Done. Searching for "Something Just Like This" by The Chainsmokers... Done. Searching for "Thunder" by Imagine Dragons... Done. Searching for "SAD!" by XXXTENTACION... Done. Searching for "I Took A Pill In Ibiza - Seeb Remix" by Mike Posner... Done. Searching for "XO Tour Llif3" by Lil Uzi Vert... Done. Searching for "HUMBLE." by Kendrick Lamar... Done. Searching for "Let Me Love You" by DJ Snake... Done. Searching for "Faded" by Alan Walker... Done. Searching for "Better Now" by Post Malone... Done. Searching for "Lucid Dreams" by Juice WRLD... Done. Searching for "Stressed Out" by Twenty One Pilots... Done. Searching for "Congratulations" by Post Malone... Done. Searching for "All of Me" by John Legend... Done. Searching for "Bohemian Rhapsody" by Queen... Done. Searching for "Someone You Loved" by Lewis Capaldi... Done. Searching for "Happier" by Marshmello... Done. Searching for "Treat You Better" by Shawn Mendes... Done. Searching for "Take Me to Church" by Hozier... Done. Searching for "Unforgettable" by French Montana... Done. Searching for "Cheap Thrills" by Sia... Done. Searching for "Uptown Funk" by Mark Ronson... Done. Searching for "Stay With Me" by Sam Smith... Done. Searching for "7 rings" by Ariana Grande... Done. Searching for "Let Her Go" by Passenger... Done. Searching for "Shallow" by Lady Gaga... Done. Searching for "Despacito - Remix" by Luis Fonsi... Done. Searching for "Cold Water" by Major Lazer... Done. Searching for "What Do You Mean?" by Justin Bieber... Done. Searching for "Jocelyn Flores" by XXXTENTACION... Done. Searching for "7 Years" by Lukas Graham... Done. Searching for "Girls Like You" by Maroon 5... Done. Searching for "thank u, next" by Ariana Grande... Done. Searching for "Can't Stop the Feeling" by Justin Timberlake... Done. Searching for "Too Good At Goodbyes" by Sam Smith... Done. Searching for "Stitches" by Shawn Mendes... Done. Searching for "That's What I Like" by Bruno Mars... Done. Searching for "Cheerleader - Felix Jaehn Remix Radio Edit" by OMI... Done. Searching for "Wake Me Up" by Avicii... Done. Searching for "I Like It" by Cardi B... Done. Searching for "Psycho" by Post Malone... Done. Searching for "This Is What You Came For" by Calvin Harris... Done. Searching for "Heathens" by Twenty One Pilots... Done. Searching for "Can't Feel My Face" by The Weeknd... Done. Searching for "See You Again" by Wiz Khalifa... Done. Searching for "Hello" by Adele... Done. Searching for "Radioactive" by Imagine Dragons... Done. Searching for "I Fall Apart" by Post Malone... Done. Searching for "Work from Home" by Fifth Harmony... Done. Searching for "Attention" by Charlie Puth... Done. Searching for "Counting Stars" by OneRepublic... Done. Searching for "In My Feelings" by Drake... Done. Searching for "Without Me" by Halsey... Done. Searching for "I Don't Care" by Ed Sheeran... Done. Searching for "Can't Hold Us - feat. Ray Dalton" by Macklemore & Ryan Lewis... Done. Searching for "One Kiss" by Calvin Harris... Done. Searching for "SICKO MODE" by Travis Scott... Done. Searching for "I Don't Wanna Live Forever" by Taylor Swift... Done. Searching for "Chandelier" by Sia... Done. Searching for "Rockabye" by Clean Bandit... Done. Searching for "Taki Taki" by DJ Snake... Done. Searching for "The Hills" by The Weeknd... Done. Searching for "Eastside" by benny blanco... Done. Searching for "Sugar" by Maroon 5... Done. Searching for "We Don't Talk Anymore" by Charlie Puth... Done. Searching for "Mi Gente" by J Balvin... Done. Searching for "IDGAF" by Dua Lipa... Done. Searching for "I'm the One" by DJ Khaled... Done. Searching for "I Like Me Better" by Lauv... Done. Searching for "Demons" by Imagine Dragons... Done. Searching for "It Ain’t Me" by Kygo... Done. Searching for "Riptide" by Vance Joy... Done. Searching for "Ride" by Twenty One Pilots... Done. Searching for "Old Town Road - Remix" by Lil Nas X... Done. Searching for "Moonlight" by XXXTENTACION... Done. Searching for "Work" by Rihanna... Done. Searching for "lovely" by Billie Eilish... Done. Searching for "There's Nothing Holdin' Me Back" by Shawn Mendes... Done. Searching for "Me Rehúso" by Danny Ocean... Done.
There are still some errors despite the standarization. We solve these by detecting the null values and replacing by correct values
while None in lyrics:
none_index = [i for i in range(len(lyrics)) if lyrics[i] == None]
for i in none_index:
lyrics[i] = genius.search_song(df_top['name'][i], df_top['artist'][i])
Next, we append the lyrics into the dataframe and operate to get its emotional features.
lyrics_text = []
for lyric in lyrics:
lyrics_text.append(lyric.lyrics.lower())
df_top['lyrics'] = lyrics_text
Finally, we use the NRC Lexicon to obtain the lyrical features of each song:
for i in range(len(nrc['emotion'].unique())):
df_top[nrc['emotion'].unique()[i]+"_index"] = df_top.apply(lambda row: emotion_prop(row['lyrics'])[i], axis=1)
Finally, we transform the release date into a date type and obtain the number of days since release as a measure of the song's longevity. First, we create a few helper functions.
def string_to_date(string):
return date(*map(int, string.split('-')))
df_top['release_date'] = df_top['release_date'].apply(string_to_date)
def days_since_release(date):
x = date.today() - date
return x.days
df_top['days_since_release'] = df_top['release_date'].apply(days_since_release)
Finally, we apply some transformations on a few variables in order to minimize the condition number of the linear regression.
df_top['streams'] = df_top['streams'].apply(lambda x: float(x))
df_top['loudness'] = (df_top['loudness']-df_top['loudness'].min())/(df_top['loudness'].max()-df_top['loudness'].min())
df_top['length'] = (df_top['length']-df_top['length'].min())/(df_top['length'].max()-df_top['length'].min())
df_top['popularity'] = (df_top['popularity']-df_top['popularity'].min())/(df_top['popularity'].max()-df_top['popularity'].min())
df_top['days_since_release'] = (df_top['days_since_release']-df_top['days_since_release'].min())/(df_top['days_since_release'].max()-df_top['days_since_release'].min())
df_top['tempo'] = (df_top['tempo']-df_top['tempo'].min())/(df_top['tempo'].max()-df_top['tempo'].min())
df_top['streams_scaled'] = (df_top['streams']-df_top['streams'].min())/(df_top['streams'].max()-df_top['streams'].min())
#we drop non-relevant variables and anticipation_index due to multicollinearity
X = df_top.drop(["streams_scaled","name", "album", "artist","release_date",
"popularity","time_signature","streams","lyrics","anticipation_index"], axis=1).copy()
X2 = sm.add_constant(X)
y = df_top['streams_scaled']
lr_model = linear_model.LinearRegression()
lr_model.fit(X, y)
print(sm.OLS(y, X2).fit().summary())
OLS Regression Results ============================================================================== Dep. Variable: streams_scaled R-squared: 0.239 Model: OLS Adj. R-squared: 0.047 Method: Least Squares F-statistic: 1.244 Date: Mon, 20 Apr 2020 Prob (F-statistic): 0.243 Time: 18:04:44 Log-Likelihood: 60.175 No. Observations: 100 AIC: -78.35 Df Residuals: 79 BIC: -23.64 Df Model: 20 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- const -0.1094 0.531 -0.206 0.837 -1.166 0.947 length -0.0853 0.128 -0.664 0.508 -0.341 0.170 danceability 0.0269 0.152 0.176 0.860 -0.276 0.330 acousticness -0.0050 0.085 -0.059 0.953 -0.174 0.164 energy -0.4499 0.189 -2.382 0.020 -0.826 -0.074 instrumentalness 1.4024 0.826 1.697 0.094 -0.242 3.047 liveness 0.3604 0.172 2.099 0.039 0.019 0.702 loudness 0.2562 0.126 2.040 0.045 0.006 0.506 speechiness 0.1976 0.221 0.893 0.375 -0.243 0.638 valence 0.0854 0.102 0.840 0.403 -0.117 0.288 tempo 0.0330 0.076 0.433 0.666 -0.119 0.185 anger_index 0.3789 0.855 0.443 0.659 -1.322 2.080 disgust_index -0.0465 0.728 -0.064 0.949 -1.495 1.402 fear_index -0.4628 0.705 -0.656 0.513 -1.866 0.940 joy_index 0.9893 0.770 1.284 0.203 -0.544 2.523 negative_index -0.1660 0.618 -0.269 0.789 -1.396 1.064 positive_index 0.3695 0.589 0.627 0.533 -0.804 1.543 sadness_index 1.1241 0.686 1.639 0.105 -0.241 2.489 surprise_index -0.0660 1.087 -0.061 0.952 -2.230 2.098 trust_index 0.3652 0.691 0.529 0.599 -1.010 1.741 days_since_release 0.1396 0.192 0.729 0.468 -0.242 0.521 ============================================================================== Omnibus: 55.827 Durbin-Watson: 0.543 Prob(Omnibus): 0.000 Jarque-Bera (JB): 227.537 Skew: 1.857 Prob(JB): 3.90e-50 Kurtosis: 9.389 Cond. No. 215. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We are ready to go! Let's visualize the relationship between the musical and lyrical features of a song and its popularity.
columns_music = (df_top.columns)[6:14]
fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(4):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_music[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=columns_music[2*i+j].capitalize(),row=i+1, col=j+1)
fig.update_layout(height=2000, width=900, title= {'text': "Musical Features", 'x':0.5},showlegend=False)
fig.update_traces(marker_size=4)
fig.show()
Let's try to analyze lyrical features next!
columns_emotion = (df_top.columns)[18:]
fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(5):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()
The model seems to perform quite OK. Still, I am a bit worried some variables are not bringing anything to the table. I will try to refine the model in the next section.
Let's implement a Lasso regression and compare its MSE to our previous linear model.
lasso_model = linear_model.Lasso()
lasso_model.fit(X, y)
print("Linear Regression MSE:", metrics.mean_squared_error(y, lr_model.predict(X)))
print("Lasso Regression MSE:", metrics.mean_squared_error(y, lasso_model.predict(X)))
Linear Regression MSE: 0.017573399462309323 Lasso Regression MSE: 0.023106569591163435
Interestingly, it seems that our Linear model manages to outperform our Lasso models. Let's visualize their different predictions as before.
fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(4):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lasso_model.predict(X),
marker=dict(color='Red'),
hovertemplate = "<b>Lasso model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_music[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=(columns_music[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Musical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()
fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(5):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lasso_model.predict(X),
marker=dict(color='Red'),
hovertemplate = "<b>Lasso model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()
Something fishy is going on with the lasso regression... It looks like an horizontal line! Let's extract the coefficients of both model and compare them.
lasso_model = linear_model.Lasso()
lasso_model.fit(X, y)
lasso_coefs = pd.Series(dict(zip(list(X), lasso_model.coef_)))
lr_coefs = pd.Series(dict(zip(list(X), lr_model.coef_)))
coefs = pd.DataFrame(dict(lasso=lasso_coefs, linreg=lr_coefs))
print(coefs)
lasso linreg length 0.0 -0.085312 danceability 0.0 0.026892 acousticness 0.0 -0.005017 energy -0.0 -0.449886 instrumentalness 0.0 1.402413 liveness 0.0 0.360372 loudness 0.0 0.256238 speechiness 0.0 0.197584 valence 0.0 0.085369 tempo -0.0 0.032987 anger_index -0.0 0.378939 disgust_index -0.0 -0.046521 fear_index -0.0 -0.462835 joy_index 0.0 0.989285 negative_index -0.0 -0.165996 positive_index 0.0 0.369506 sadness_index -0.0 1.124084 surprise_index 0.0 -0.065989 trust_index 0.0 0.365189 days_since_release -0.0 0.139555
Our suspicions are confirmed: all Lasso coefficients are 0. That's a pretty weird result. This means that our Lasso regression reduces to the simplest of estimators: sample mean of the dependent variable. Let's splice the model between testing and training data and compare their MSE to check whether our linear model still beats the Lasso model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)
def fit_and_report_mses(mod, X_train, X_test, y_train, y_test):
mod.fit(X_train, y_train)
return dict(
mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test))
)
print("Linear Model:", fit_and_report_mses(linear_model.LinearRegression(), X_train, X_test, y_train, y_test))
print("Lasso Regression:", fit_and_report_mses(linear_model.Lasso(), X_train, X_test, y_train, y_test))
Linear Model: {'mse_train': 0.011462616788246201, 'mse_test': 0.044234680031768454} Lasso Regression: {'mse_train': 0.014931725291319514, 'mse_test': 0.04784550215519792}
Even if the linear model performs better on the training stage, Lasso (and hence, the naive estimator) beats our linear model. That's a quite worrying fact, since it means that our variables have very limited predicting capabilites. I will try one last model: a neural network model as implemented during the lecture notes. Let's see if we can beat both other models. First, I use it on the whole dataset.
# do not forget to scale the model!
nn_scaled_model = pipeline.make_pipeline(
preprocessing.StandardScaler(),
neural_network.MLPRegressor((30, 20))
)
nn_scaled_model.fit(X, y)
print("Linear Regression MSE:", metrics.mean_squared_error(y, lr_model.predict(X)))
print("Lasso Regression MSE:", metrics.mean_squared_error(y, lasso_model.predict(X)))
print("NN Regression MSE:", metrics.mean_squared_error(y, nn_scaled_model.predict(X)))
Linear Regression MSE: 0.017573399462309323 Lasso Regression MSE: 0.023106569591163435 NN Regression MSE: 0.007307496202566639
Cool! This model seems to outperform both previous models by a very good margin. Let's visualize the predictions.
fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(4):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lasso_model.predict(X),
marker=dict(color='Red'),
hovertemplate = "<b>Lasso model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=nn_scaled_model.predict(X),
mode="markers",marker=dict(color='Purple'),
hovertemplate = "<b>Neural Network model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Musical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()
fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
vertical_spacing = 0.05,
horizontal_spacing = 0.025,
y_title = "Scaled streams" )
for i in range(5):
for j in range(2):
fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
mode="markers",marker=dict(color='Orange'),
hovertemplate = "<b>Full OLS model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lasso_model.predict(X),
marker=dict(color='Red'),
hovertemplate = "<b>Lasso model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=nn_scaled_model.predict(X),
mode="markers",marker=dict(color='Purple'),
hovertemplate = "<b>Neural Network model prediction</b><br><br>" +
"streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
row=i+1, col=j+1)
results = px.get_trendline_results(fig_trend)
fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()
Finally, let's divide split the dataset between a training/testing part and hope that our NN model beats the naive estimator.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)
def fit_and_report_mses(mod, X_train, X_test, y_train, y_test):
mod.fit(X_train, y_train)
return dict(
mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test))
)
print("Linear Regression:", fit_and_report_mses(linear_model.LinearRegression(), X_train, X_test, y_train, y_test))
print("Lasso Regression:", fit_and_report_mses(linear_model.Lasso(), X_train, X_test, y_train, y_test))
print("NN Model:", fit_and_report_mses(nn_scaled_model, X_train, X_test, y_train, y_test))
Linear Regression: {'mse_train': 0.02048448685532459, 'mse_test': 0.013027925914433454} Lasso Regression: {'mse_train': 0.0272749774766269, 'mse_test': 0.010809195586301959} NN Model: {'mse_train': 0.0035198720881701186, 'mse_test': 0.05458352120520873}
Sadly, none of our models were able to beat the naive estimator in the testing phase.
I would like to continue refining these models as most likely there is an issue of overspecification. However, the project is already quite long so I feel this is a good place to end it. Visually, it seems to me that musical features are not very good indicators of popularity while lyrical features have stronger predictive capabilities: people tend to like positive and happy music, at least when it comes to its lyrical content. However, the triumph of the naive estimator is a quite dissapointing result, so further research on this topic would focus on which regressors should be included in the model and which regressors should be droped.
Besides these conclusions, my main objective with this project was to lay tge groundwork for future data analysis of music. That is, to create a somewhat structured way to gather this data and quantify its features. Still, given that the Spotify API does not gives us a way to access the number of times a song has been streamed, alternative ways to measure this variable should be explored. Otherwise conducting further research will prove difficult.