This project aims to demonstrate my SQL skills acquired due to SQL online courses that are a part of the Dataquest's Data Science in Python track.
In this project, I will be working with a modified version of a database called Chinook. The Chinook database contains information about a fictional digital music shop that structurally may resemble a mini-iTunes store.
The Chinook database contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees, customers, and the customers' purchases. This information is contained in 11 tables. A schema diagram for the Chinook database below gives an overview of the available columns and the structure of the data:
I will use the following code to connect our Jupyter Notebook to our database file:
%%capture
%load_ext sql
%sql sqlite:///chinook.db
import pandas as pd
import numpy as np
import sqlite3
# Add plotly library for graph generation
import plotly.express as px
To check if a database is connected, I will run a simple query that shows all the tables' names in our database:
%%sql
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table","view");
* sqlite:///chinook.db Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
Let us find out the most popular genres of tracks purchased and listened in the USA. To answer this question, I will have to merge and aggregate data from multiple tables.
Firstly, I create a subquery joining invoice
and customer
tables and call it american_inv
. This subquery will result in a subset of invoices made by American customers. Secondly, I merge invoice_line
with track
and genre
tables. The resulting invoice_genre
will include all the invoices along with a genre of the purchased tracks. Finally, I join american_inv
and invoice_genre
to get the Americans' invoices grouped by different genres.
The final result of the manipulations described above is a table indicating the top 10 genres among the American customers.
In the context of the music industry, this information may be vital to make informed business decisions about new deals with music labels, marketing campaigns, strategic development, etc.
query = """
WITH american_inv AS
(
SELECT i.*
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
WHERE c.country = "USA"
),
invoice_genre AS
(
SELECT
il.*,
g.name genre
FROM invoice_line il
INNER JOIN track t ON il.track_id = t.track_id
INNER JOIN genre g ON t.genre_id = g.genre_id
),
genre_tracks AS
(
SELECT
i_g.genre,
COUNT(i_g.genre) n_of_tracks
FROM american_inv a_i
INNER JOIN invoice_genre i_g ON a_i.invoice_id = i_g.invoice_id
GROUP BY i_g.genre
ORDER BY 2 DESC
)
SELECT
genre,
n_of_tracks purchased_tracks,
ROUND(n_of_tracks / CAST((SELECT SUM(n_of_tracks) FROM genre_tracks) AS FLOAT), 2) market_share
FROM genre_tracks;
"""
# Store query's result set as pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
df_top_10 = df[:10]
df_top_10
genre | purchased_tracks | market_share | |
---|---|---|---|
0 | Rock | 561 | 0.53 |
1 | Alternative & Punk | 130 | 0.12 |
2 | Metal | 124 | 0.12 |
3 | R&B/Soul | 53 | 0.05 |
4 | Blues | 36 | 0.03 |
5 | Alternative | 35 | 0.03 |
6 | Pop | 22 | 0.02 |
7 | Latin | 22 | 0.02 |
8 | Hip Hop/Rap | 20 | 0.02 |
9 | Jazz | 14 | 0.01 |
# Create graph to visualise findings
fig = px.bar(df_top_10, x='market_share', y='genre',text='purchased_tracks',
labels={'market_share':'Market share',
'purchased_tracks':'Number of purchased tracks',
'genre':'Genre'},
title = 'Top 10 Genres in the USA'
)
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()
fig = px.pie(df, values='purchased_tracks', names='genre',
title='Top Genres in the USA',
labels={'market_share':'Market share',
'purchased_tracks':'Number of purchased tracks',
'genre':'Genre'})
fig.show()
Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. Let us analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.
To solve the task, I will divide it into two parts. Firstly, I calculate the total receipt per each customer. I fulfil this task by joining customer
and invoice
tables and grouping them by customer_id
. Secondly, I join the resulting customer_total
and employee
table. The final table will include total sales per each employee and the number of months they have worked since hire_date
as for 01.01.2018.
query = """
WITH customer_total AS
(
SELECT
c.customer_id,
c.support_rep_id agent_id,
SUM(i.total) total
FROM customer c
INNER JOIN invoice i ON c.customer_id = i.customer_id
GROUP BY 1
),
employee_total AS
(
SELECT
e.first_name ||" "|| e.last_name AS employee,
strftime('%d.%m.%Y', DATETIME(e.hire_date)) AS hire_date,
CASE
WHEN SUM(c_t.total) > 0 THEN SUM(c_t.total)
ELSE 0
END AS total_sales,
CAST((julianday(DATE('2018-01-01')) - julianday(e.hire_date))/30 AS INTEGER)||' months' AS months_employed
FROM employee e
LEFT JOIN customer_total c_t ON c_t.agent_id = e.employee_id
GROUP BY 1
ORDER BY 3 DESC, e.hire_date DESC
)
SELECT * FROM employee_total;
"""
# Store query's result set as pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
df
employee | hire_date | total_sales | months_employed | |
---|---|---|---|---|
0 | Jane Peacock | 01.04.2017 | 1731.51 | 9 months |
1 | Margaret Park | 03.05.2017 | 1584.00 | 8 months |
2 | Steve Johnson | 17.10.2017 | 1393.92 | 2 months |
3 | Laura Callahan | 04.03.2017 | 0.00 | 10 months |
4 | Robert King | 02.01.2017 | 0.00 | 12 months |
5 | Michael Mitchell | 17.10.2016 | 0.00 | 14 months |
6 | Andrew Adams | 14.08.2016 | 0.00 | 16 months |
7 | Nancy Edwards | 01.05.2016 | 0.00 | 20 months |
It is crystal clear that there are only three employees who actively engaged in sales. However, there is no direct or indirect information clarifying the employment status of these employees. Therefore, it would be prudent to refrain from any conclusions and comparisons of employees' performance.
# Select top 3 employees
df_top_3 = df.iloc[:3,:]
# Create graph to visualise findings
fig = px.bar(df_top_3, x='employee', y='total_sales',text='months_employed',
labels={'employee':'Employee',
'months_employed':'Months Employeed',
'total_sales':'Total Sales per Employee'},
title = 'Perfomance of Sales Support Agents'
)
fig.show()
In this section, I will analyze the sales data for customers from each different country. I will use the country value from the customers
table, and ignore the country from the billing address in the invoice
table.
In particular, I will calculate data, for each country, on the:
Because there are several countries with only one customer, I will group them as "Other" category.
query = """
WITH customer_total AS
(
SELECT
CASE
WHEN COUNT(DISTINCT i.customer_id) < 2 THEN "Other"
ELSE c.country
END AS country,
ROUND(SUM(i.total)) total_sales,
COUNT(DISTINCT i.customer_id) n_customers,
COUNT(DISTINCT i.invoice_id) n_orders
FROM customer c
INNER JOIN invoice i ON c.customer_id = i.customer_id
GROUP BY country
)
SELECT
country,
SUM(n_customers) total_customers,
total_sales,
ROUND(total_sales / SUM(n_customers),2) sales_per_customer,
ROUND(total_sales / n_orders,2) average_order_value
FROM customer_total
GROUP BY country
ORDER BY
CASE
WHEN country = "Other" THEN 1
ELSE 0
END,
3 DESC;
"""
# Read query's result set into a pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
# Create graph to visualise findings
fig = px.scatter(df, y='total_customers', x='total_sales', size='sales_per_customer', color='country',
labels={'total_customers':'Total Number of Customers',
'country':'Country',
'total_sales':'Total Sales',
'sales_per_customer':'Sales per Customer'},
title = 'Sales by Country'
)
fig.show()
The graph shows that the USA is the biggest and reasonably the most lucrative market. Interestingly, the Czech Republic has the highest sales per customer compared to other countries.
However, the number of customers in the dataset is too small to draw trustworthy conclusions and inferences.
The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:
Let us find out what percentage of purchases are individual tracks vs whole albums. To achieve this goal, I will first group all tracks by album_id
in the track
table to count the number of tracks in each album. I save all the manipulations in a subquery called album_track_n
. Next, I group all the invoices from the invoice_line
table by invoice_id
and album_id
to count the number of tracks purchased from each album in the invoice. Finally, I compare both tables by album_id
.
NB! Albums that consist of one or two tracks are ignored
query = """
WITH album_track_n AS
(
SELECT
album_id,
COUNT(track_id) n_tracks
FROM track
GROUP BY album_id
HAVING n_tracks > 2
),
album_order AS
(
SELECT
i_l.invoice_id,
t.album_id album_id_ordered,
COUNT(i_l.track_id) n_tracks_ordered
FROM invoice_line i_l
JOIN track t ON i_l.track_id = t.track_id
GROUP BY 1, 2
)
SELECT
CASE
WHEN a_o.n_tracks_ordered = a_t.n_tracks THEN "YES"
ELSE "NO"
END AS full_album_purchased,
COUNT(DISTINCT a_o.invoice_id) n_invoices,
ROUND(CAST(COUNT(DISTINCT a_o.invoice_id) AS FLOAT) / ( SELECT COUNT(*) FROM invoice)* 100, 1) share
FROM album_order a_o
INNER JOIN album_track_n a_t ON a_o.album_id_ordered = a_t.album_id
GROUP BY full_album_purchased;
"""
# Read query's result set into a pandas DataFrame
conn = sqlite3.connect("chinook.db")
df = pd.read_sql_query(query, conn)
# Create graph to visualise findings
fig = px.pie(df, values='share', names='full_album_purchased', title='Album vs Individual Track Purchase')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(uniformtext_minsize=15,uniformtext_mode='hide')
fig.show()
Most users do not buy entire albums – roughly 18% of the invoices account for purchases of whole albums. More than 81% prefer purchasing select tracks.
In this project, I have analyzed a modified version of the Chinook database using SQL. As the data in the database is fictional, all the data manipulations and inferences have been made for training purposes only.