Answering Business Questions using SQL¶

Motivation¶

This project aims to demonstrate my SQL skills acquired due to SQL online courses that are a part of the Dataquest's Data Science in Python track.

Description¶

In this project, I will be working with a modified version of a database called Chinook. The Chinook database contains information about a fictional digital music shop that structurally may resemble a mini-iTunes store.

The Chinook database contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees, customers, and the customers' purchases. This information is contained in 11 tables. A schema diagram for the Chinook database below gives an overview of the available columns and the structure of the data:

I will use the following code to connect our Jupyter Notebook to our database file:

In [1]:

%%capture
%load_ext sql
%sql sqlite:///chinook.db

In [2]:

import pandas as pd
import numpy as np
import sqlite3

# Add plotly library for graph generation
import plotly.express as px

To check if a database is connected, I will run a simple query that shows all the tables' names in our database:

In [3]:

%%sql
SELECT
    name,
    type
FROM sqlite_master
WHERE type IN ("table","view");

 * sqlite:///chinook.db
Done.

Out[3]:

name	type
album	table
artist	table
customer	table
employee	table
genre	table
invoice	table
invoice_line	table
media_type	table
playlist	table
playlist_track	table
track	table

The most popular genres in the USA¶

Let us find out the most popular genres of tracks purchased and listened in the USA. To answer this question, I will have to merge and aggregate data from multiple tables.

Firstly, I create a subquery joining invoice and customer tables and call it american_inv. This subquery will result in a subset of invoices made by American customers. Secondly, I merge invoice_line with track and genre tables. The resulting invoice_genre will include all the invoices along with a genre of the purchased tracks. Finally, I join american_inv and invoice_genre to get the Americans' invoices grouped by different genres. The final result of the manipulations described above is a table indicating the top 10 genres among the American customers.

In the context of the music industry, this information may be vital to make informed business decisions about new deals with music labels, marketing campaigns, strategic development, etc.

In [4]:

query = """
WITH american_inv AS 
    (
        SELECT i.*
        FROM invoice i
        INNER JOIN customer c ON c.customer_id = i.customer_id
        WHERE c.country = "USA"
    ),
    invoice_genre AS 
    (
        SELECT 
            il.*, 
            g.name genre
        FROM invoice_line il
        INNER JOIN track t ON il.track_id = t.track_id
        INNER JOIN genre g ON t.genre_id = g.genre_id
    ),
    genre_tracks AS
    (
        SELECT 
            i_g.genre,
            COUNT(i_g.genre) n_of_tracks
        FROM american_inv a_i
        INNER JOIN invoice_genre i_g ON a_i.invoice_id = i_g.invoice_id
        GROUP BY i_g.genre
        ORDER BY 2 DESC
    )
       
SELECT 
    genre,
    n_of_tracks purchased_tracks,
    ROUND(n_of_tracks / CAST((SELECT SUM(n_of_tracks) FROM genre_tracks) AS FLOAT), 2) market_share
FROM genre_tracks;
"""

# Store query's result set as pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
df_top_10 = df[:10]

df_top_10

Out[4]:

	genre	purchased_tracks	market_share
0	Rock	561	0.53
1	Alternative & Punk	130	0.12
2	Metal	124	0.12
3	R&B/Soul	53	0.05
4	Blues	36	0.03
5	Alternative	35	0.03
6	Pop	22	0.02
7	Latin	22	0.02
8	Hip Hop/Rap	20	0.02
9	Jazz	14	0.01

In [5]:

# Create graph to visualise findings
fig = px.bar(df_top_10, x='market_share', y='genre',text='purchased_tracks',
             labels={'market_share':'Market share',
                     'purchased_tracks':'Number of purchased tracks',
                     'genre':'Genre'},
             title = 'Top 10 Genres in the USA'
            )
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()

fig = px.pie(df, values='purchased_tracks', names='genre',
             title='Top Genres in the USA',
             labels={'market_share':'Market share',
                     'purchased_tracks':'Number of purchased tracks',
                     'genre':'Genre'})
fig.show()

Perfomance of Sales Support Agents¶

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. Let us analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

To solve the task, I will divide it into two parts. Firstly, I calculate the total receipt per each customer. I fulfil this task by joining customer and invoice tables and grouping them by customer_id. Secondly, I join the resulting customer_total and employee table. The final table will include total sales per each employee and the number of months they have worked since hire_date as for 01.01.2018.

In [6]:

query = """
WITH customer_total AS 
    (
        SELECT 
            c.customer_id,
            c.support_rep_id agent_id,
            SUM(i.total) total
        FROM customer c
        INNER JOIN invoice i ON c.customer_id = i.customer_id
        GROUP BY 1
    ),
    employee_total AS
    (
        SELECT 
            e.first_name ||" "|| e.last_name AS employee,
            strftime('%d.%m.%Y', DATETIME(e.hire_date)) AS hire_date,
            CASE
                WHEN SUM(c_t.total) > 0 THEN SUM(c_t.total)
                ELSE 0
            END AS total_sales,
            CAST((julianday(DATE('2018-01-01')) - julianday(e.hire_date))/30 AS INTEGER)||' months' AS months_employed
        FROM employee e
        LEFT JOIN customer_total c_t ON c_t.agent_id = e.employee_id
        GROUP BY 1
        ORDER BY 3 DESC, e.hire_date DESC
    )

SELECT * FROM employee_total;
"""
# Store query's result set as pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)

df

Out[6]:

	employee	hire_date	total_sales	months_employed
0	Jane Peacock	01.04.2017	1731.51	9 months
1	Margaret Park	03.05.2017	1584.00	8 months
2	Steve Johnson	17.10.2017	1393.92	2 months
3	Laura Callahan	04.03.2017	0.00	10 months
4	Robert King	02.01.2017	0.00	12 months
5	Michael Mitchell	17.10.2016	0.00	14 months
6	Andrew Adams	14.08.2016	0.00	16 months
7	Nancy Edwards	01.05.2016	0.00	20 months

It is crystal clear that there are only three employees who actively engaged in sales. However, there is no direct or indirect information clarifying the employment status of these employees. Therefore, it would be prudent to refrain from any conclusions and comparisons of employees' performance.

In [7]:

# Select top 3 employees
df_top_3 = df.iloc[:3,:]

# Create graph to visualise findings
fig = px.bar(df_top_3, x='employee', y='total_sales',text='months_employed',
             labels={'employee':'Employee',
                     'months_employed':'Months Employeed',
                     'total_sales':'Total Sales per Employee'},
             title = 'Perfomance of Sales Support Agents'
            )
fig.show()

Analyzing Sales by Country¶

In this section, I will analyze the sales data for customers from each different country. I will use the country value from the customers table, and ignore the country from the billing address in the invoice table.

In particular, I will calculate data, for each country, on the:

total number of customers
total value of sales
average value of sales per customer
average order value

Because there are several countries with only one customer, I will group them as "Other" category.

In [8]:

query = """
WITH customer_total AS 
    (
        SELECT 
            CASE
                WHEN COUNT(DISTINCT i.customer_id) < 2 THEN "Other"
                ELSE c.country
            END AS country,
            ROUND(SUM(i.total)) total_sales,
            COUNT(DISTINCT i.customer_id) n_customers,
            COUNT(DISTINCT i.invoice_id) n_orders
        FROM customer c
        INNER JOIN invoice i ON c.customer_id = i.customer_id
        GROUP BY country
    )

SELECT 
    country, 
    SUM(n_customers) total_customers,
    total_sales,
    ROUND(total_sales / SUM(n_customers),2) sales_per_customer,
    ROUND(total_sales / n_orders,2) average_order_value
FROM customer_total
GROUP BY country
ORDER BY 
    CASE 
        WHEN country = "Other" THEN 1 
        ELSE 0 
    END, 
    3 DESC;
"""

# Read query's result set into a pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)

# Create graph to visualise findings
fig = px.scatter(df, y='total_customers', x='total_sales', size='sales_per_customer', color='country',
                labels={'total_customers':'Total Number of Customers',
                        'country':'Country',
                        'total_sales':'Total Sales',
                        'sales_per_customer':'Sales per Customer'},
                 title = 'Sales by Country'
                )
fig.show()

The graph shows that the USA is the biggest and reasonably the most lucrative market. Interestingly, the Czech Republic has the highest sales per customer compared to other countries.

However, the number of customers in the dataset is too small to draw trustworthy conclusions and inferences.

Album vs Individual Track Purchase¶

The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:

purchase a whole album
purchase a collection of one or more individual tracks.

Let us find out what percentage of purchases are individual tracks vs whole albums. To achieve this goal, I will first group all tracks by album_id in the track table to count the number of tracks in each album. I save all the manipulations in a subquery called album_track_n. Next, I group all the invoices from the invoice_line table by invoice_id and album_id to count the number of tracks purchased from each album in the invoice. Finally, I compare both tables by album_id.

NB! Albums that consist of one or two tracks are ignored

In [9]:

query = """
WITH album_track_n AS
    (
        SELECT 
            album_id,
            COUNT(track_id) n_tracks
        FROM track
        GROUP BY album_id
        HAVING n_tracks > 2
    ),
    album_order AS
    (
        SELECT 
            i_l.invoice_id,
            t.album_id album_id_ordered,
            COUNT(i_l.track_id) n_tracks_ordered
        FROM invoice_line i_l
        JOIN track t ON i_l.track_id = t.track_id
        GROUP BY 1, 2
    )

SELECT 
    CASE 
        WHEN a_o.n_tracks_ordered = a_t.n_tracks THEN "YES"
        ELSE "NO"
    END AS full_album_purchased,
    COUNT(DISTINCT a_o.invoice_id) n_invoices,
    ROUND(CAST(COUNT(DISTINCT a_o.invoice_id) AS FLOAT) / ( SELECT COUNT(*) FROM invoice)* 100, 1) share
FROM album_order a_o
INNER JOIN album_track_n a_t ON a_o.album_id_ordered = a_t.album_id
GROUP BY full_album_purchased;
"""

# Read query's result set into a pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)

# Create graph to visualise findings
fig = px.pie(df, values='share', names='full_album_purchased', title='Album vs Individual Track Purchase')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(uniformtext_minsize=15,uniformtext_mode='hide')
fig.show()

Most users do not buy entire albums – roughly 18% of the invoices account for purchases of whole albums. More than 81% prefer purchasing select tracks.

Conclusion¶

In this project, I have analyzed a modified version of the Chinook database using SQL. As the data in the database is fictional, all the data manipulations and inferences have been made for training purposes only.