Answering Business Questions using SQL


Motivation

This project aims to demonstrate my SQL skills acquired due to SQL online courses that are a part of the Dataquest's Data Science in Python track.

Description

In this project, I will be working with a modified version of a database called Chinook. The Chinook database contains information about a fictional digital music shop that structurally may resemble a mini-iTunes store.

The Chinook database contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees, customers, and the customers' purchases. This information is contained in 11 tables. A schema diagram for the Chinook database below gives an overview of the available columns and the structure of the data:

I will use the following code to connect our Jupyter Notebook to our database file:

In [1]:
%%capture
%load_ext sql
%sql sqlite:///chinook.db
In [2]:
import pandas as pd
import numpy as np
import sqlite3

# Add plotly library for graph generation
import plotly.express as px

To check if a database is connected, I will run a simple query that shows all the tables' names in our database:

In [3]:
%%sql
SELECT
    name,
    type
FROM sqlite_master
WHERE type IN ("table","view");
 * sqlite:///chinook.db
Done.
Out[3]:
name type
album table
artist table
customer table
employee table
genre table
invoice table
invoice_line table
media_type table
playlist table
playlist_track table
track table

Let us find out the most popular genres of tracks purchased and listened in the USA. To answer this question, I will have to merge and aggregate data from multiple tables.

Firstly, I create a subquery joining invoice and customer tables and call it american_inv. This subquery will result in a subset of invoices made by American customers. Secondly, I merge invoice_line with track and genre tables. The resulting invoice_genre will include all the invoices along with a genre of the purchased tracks. Finally, I join american_inv and invoice_genre to get the Americans' invoices grouped by different genres. The final result of the manipulations described above is a table indicating the top 10 genres among the American customers.

In the context of the music industry, this information may be vital to make informed business decisions about new deals with music labels, marketing campaigns, strategic development, etc.

In [4]:
query = """
WITH american_inv AS 
    (
        SELECT i.*
        FROM invoice i
        INNER JOIN customer c ON c.customer_id = i.customer_id
        WHERE c.country = "USA"
    ),
    invoice_genre AS 
    (
        SELECT 
            il.*, 
            g.name genre
        FROM invoice_line il
        INNER JOIN track t ON il.track_id = t.track_id
        INNER JOIN genre g ON t.genre_id = g.genre_id
    ),
    genre_tracks AS
    (
        SELECT 
            i_g.genre,
            COUNT(i_g.genre) n_of_tracks
        FROM american_inv a_i
        INNER JOIN invoice_genre i_g ON a_i.invoice_id = i_g.invoice_id
        GROUP BY i_g.genre
        ORDER BY 2 DESC
    )
       
SELECT 
    genre,
    n_of_tracks purchased_tracks,
    ROUND(n_of_tracks / CAST((SELECT SUM(n_of_tracks) FROM genre_tracks) AS FLOAT), 2) market_share
FROM genre_tracks;
"""

# Store query's result set as pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
df_top_10 = df[:10]

df_top_10
Out[4]:
genre purchased_tracks market_share
0 Rock 561 0.53
1 Alternative & Punk 130 0.12
2 Metal 124 0.12
3 R&B/Soul 53 0.05
4 Blues 36 0.03
5 Alternative 35 0.03
6 Pop 22 0.02
7 Latin 22 0.02
8 Hip Hop/Rap 20 0.02
9 Jazz 14 0.01
In [5]:
# Create graph to visualise findings
fig = px.bar(df_top_10, x='market_share', y='genre',text='purchased_tracks',
             labels={'market_share':'Market share',
                     'purchased_tracks':'Number of purchased tracks',
                     'genre':'Genre'},
             title = 'Top 10 Genres in the USA'
            )
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()

fig = px.pie(df, values='purchased_tracks', names='genre',
             title='Top Genres in the USA',
             labels={'market_share':'Market share',
                     'purchased_tracks':'Number of purchased tracks',
                     'genre':'Genre'})
fig.show()