Introduction and Database Schema
Enabling SQL and Connect to the Database
Analyzing Sales Agent Performance
Analyzing Purchases by Country
Analyzing Purchases of Individual Tracks and Full Albums
Finding the Artist Featured in Most Playlists
Analyzing Tracks Purchased and Not Purchased
Analyzing Tracks Available and Sales Popularity
The Chinook record store has presented us with access to their database in order to enable us as analysts to use their data to answer a set of business questions. We will SQL queries to perform our analysis. The following is the database schema:
Required Jupyter commands to enable SQL and connect to chinook.db
%%capture
%load_ext sql
%sql sqlite:///chinook.db
Simple SQL queries to preview table data
%%sql
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table","view")
Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
%%sql
SELECT *
FROM employee
LIMIT 3;
Done.
employee_id | last_name | first_name | title | reports_to | birthdate | hire_date | address | city | state | country | postal_code | phone | fax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Adams | Andrew | General Manager | None | 1962-02-18 00:00:00 | 2016-08-14 00:00:00 | 11120 Jasper Ave NW | Edmonton | AB | Canada | T5K 2N1 | +1 (780) 428-9482 | +1 (780) 428-3457 | andrew@chinookcorp.com |
2 | Edwards | Nancy | Sales Manager | 1 | 1958-12-08 00:00:00 | 2016-05-01 00:00:00 | 825 8 Ave SW | Calgary | AB | Canada | T2P 2T3 | +1 (403) 262-3443 | +1 (403) 262-3322 | nancy@chinookcorp.com |
3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | jane@chinookcorp.com |
%%sql
SELECT *
FROM customer
LIMIT 3;
Done.
customer_id | first_name | last_name | company | address | city | state | country | postal_code | phone | fax | support_rep_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | luisg@embraer.com.br | 3 |
2 | Leonie | Köhler | None | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | +49 0711 2842222 | None | leonekohler@surfeu.de | 5 |
3 | François | Tremblay | None | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721-4711 | None | ftremblay@gmail.com | 3 |
%%sql
SELECT *
FROM invoice
LIMIT 3;
Done.
invoice_id | customer_id | invoice_date | billing_address | billing_city | billing_state | billing_country | billing_postal_code | total |
---|---|---|---|---|---|---|---|---|
1 | 18 | 2017-01-03 00:00:00 | 627 Broadway | New York | NY | USA | 10012-2612 | 15.84 |
2 | 30 | 2017-01-03 00:00:00 | 230 Elgin Street | Ottawa | ON | Canada | K2P 1L7 | 9.9 |
3 | 40 | 2017-01-05 00:00:00 | 8, Rue Hanovre | Paris | None | France | 75002 | 1.98 |
%%sql
SELECT *
FROM invoice_line
LIMIT 3;
Done.
invoice_line_id | invoice_id | track_id | unit_price | quantity |
---|---|---|---|---|
1 | 1 | 1158 | 0.99 | 1 |
2 | 1 | 1159 | 0.99 | 1 |
3 | 1 | 1160 | 0.99 | 1 |
%%sql
SELECT *
FROM track
LIMIT 3;
Done.
track_id | name | album_id | media_type_id | genre_id | composer | milliseconds | bytes | unit_price |
---|---|---|---|---|---|---|---|---|
1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 | Angus Young, Malcolm Young, Brian Johnson | 343719 | 11170334 | 0.99 |
2 | Balls to the Wall | 2 | 2 | 1 | None | 342562 | 5510424 | 0.99 |
3 | Fast As a Shark | 3 | 2 | 1 | F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman | 230619 | 3990994 | 0.99 |
%%sql
SELECT *
FROM playlist_track
LIMIT 3;
Done.
playlist_id | track_id |
---|---|
1 | 3402 |
1 | 3389 |
1 | 3390 |
%%sql
SELECT *
FROM playlist
LIMIT 3;
Done.
playlist_id | name |
---|---|
1 | Music |
2 | Movies |
3 | TV Shows |
%%sql
SELECT *
FROM album
LIMIT 3;
Done.
album_id | title | artist_id |
---|---|---|
1 | For Those About To Rock We Salute You | 1 |
2 | Balls to the Wall | 2 |
3 | Restless and Wild | 2 |
%%sql
SELECT *
FROM artist
LIMIT 3;
Done.
artist_id | name |
---|---|
1 | AC/DC |
2 | Accept |
3 | Aerosmith |
%%sql
SELECT *
FROM media_type
LIMIT 3;
Done.
media_type_id | name |
---|---|
1 | MPEG audio file |
2 | Protected AAC audio file |
3 | Protected MPEG-4 video file |
%%sql
SELECT *
FROM genre
LIMIT 3;
Done.
genre_id | name |
---|---|
1 | Rock |
2 | Jazz |
3 | Metal |
Chinook music store started working with a new record label that has 4 new albums available. Chinook is able to add 3 out of 4 to the store.
Business question: which 3 albums to add to the store?
These are the artists and the genres of the 4 available albums from a new record label:
Artist Name | Genre |
---|---|
Regal | Hip-Hop |
Red Tone | Punk |
Meteor and the Girl | Pop |
Slim Jim Bites | Blues |
Since the record label is based in the USA, we will only consider USA-based billing addresses.
With SQL below, create a filtered invoice table view (USA billing addresses only)
%%sql
CREATE VIEW invoice_usa AS
SELECT *
FROM invoice
WHERE billing_country = 'USA';
Done.
[]
Create a list of USA only invoice ids
%%sql
CREATE VIEW invoice_ids_usa AS
SELECT DISTINCT(invoice_id)
FROM invoice_usa;
Done.
[]
Below is a SQL query that will sort all genres by the number of tracks sold to customers with billing addresses in the USA.
%%sql
WITH
usa_total_tracks_sold AS
(
SELECT SUM(il.quantity) AS total_all_tracks
FROM invoice_line AS il
WHERE il.invoice_id IN (SELECT *
FROM invoice_ids_usa)
),
usa_tracks_genre_qty AS
(
SELECT
g.name AS genre,
SUM(il.quantity) AS tracks_sold,
ROUND((CAST(SUM(il.quantity) * 100 AS Float) / utts.total_all_tracks), 1) AS tracks_sold_pct
FROM invoice_line AS il, usa_total_tracks_sold AS utts
INNER JOIN track AS t ON il.track_id = t.track_id
INNER JOIN genre AS g ON t.genre_id = g.genre_id
WHERE il.invoice_id IN (SELECT *
FROM invoice_ids_usa)
GROUP BY genre
ORDER BY tracks_sold DESC
)
SELECT *
FROM usa_tracks_genre_qty;
Done.
genre | tracks_sold | tracks_sold_pct |
---|---|---|
Rock | 561 | 53.4 |
Alternative & Punk | 130 | 12.4 |
Metal | 124 | 11.8 |
R&B/Soul | 53 | 5.0 |
Blues | 36 | 3.4 |
Alternative | 35 | 3.3 |
Pop | 22 | 2.1 |
Latin | 22 | 2.1 |
Hip Hop/Rap | 20 | 1.9 |
Jazz | 14 | 1.3 |
Easy Listening | 13 | 1.2 |
Reggae | 6 | 0.6 |
Electronica/Dance | 5 | 0.5 |
Classical | 4 | 0.4 |
Heavy Metal | 3 | 0.3 |
Soundtrack | 2 | 0.2 |
TV Shows | 1 | 0.1 |
Based on these results, I would recommend the following:
Artist Name | Genre | Genre Share in the USA | Recommendation |
---|---|---|---|
Red Tone | Punk | 12.4% | Add to store |
Slim Jim Bites | Blues | 3.4% | Add to store |
Meteor and the Girl | Pop | 2.1% | Add to store |
Regal | Hip-Hop | 1.9% | Hold |
Business question: analyze sales support agent performance using customer purchases. We need to consider the following:
We should check with our HR that hire date corresponds to position start date, because it is common for employees to change positions. This assumption is quite safe for lower level positions, but it's a good idea to ask anyway. Let's assume the answer is 'Yes'.
SQLite provides a function called JULIANDAY()
that can be used to calculate the number of days elapsed from hire date to today. Having this calculation allows us to calculate the averages sales rate per day which is a better performance indicator than total sales.
%%sql
SELECT e.first_name || ' ' || e.last_name AS sales_support_agent_name,
e.hire_date,
ROUND((SELECT JULIANDAY('now')) - JULIANDAY(e.hire_date)) AS days_on_the_job,
ROUND(SUM(i.total)) AS total_sales,
ROUND(SUM(i.total) / ((SELECT JULIANDAY('now')) - JULIANDAY(e.hire_date)), 2) AS avg_sales_per_day
FROM employee AS e
LEFT JOIN customer AS c ON e.employee_id = c.support_rep_id
INNER JOIN invoice AS i ON c.customer_id = i.customer_id
GROUP BY sales_support_agent_name, e.hire_date;
Done.
sales_support_agent_name | hire_date | days_on_the_job | total_sales | avg_sales_per_day |
---|---|---|---|---|
Jane Peacock | 2017-04-01 00:00:00 | 1377.0 | 1732.0 | 1.26 |
Margaret Park | 2017-05-03 00:00:00 | 1345.0 | 1584.0 | 1.18 |
Steve Johnson | 2017-10-17 00:00:00 | 1178.0 | 1394.0 | 1.18 |
Calculate Steve's total sales as percent of Jane's
%%sql
SELECT 1394/1732.0 AS steve_total_sales_as_pct_of_jane
Done.
steve_total_sales_as_pct_of_jane |
---|
0.8048498845265589 |
Calculate Margaret's average sales as percent of Jane's
%%sql
SELECT 1.18/1.26 AS margaret_avg_sales_as_pct_of_jane
Done.
margaret_avg_sales_as_pct_of_jane |
---|
0.9365079365079364 |
If we were to use total sales, it would appear that Steve is the worst performer at 80% of Jane's sales, but using the average sales rate, we see that all agents are performing much more evenly, with Margaret and Steve at 94% of Jane's sales.
Business question: analyze purchases by country. Specifically, we are asked to provide the following:
Countries with only one customer will be grouped into the "Other" country group
Create a view with SQL listing the following:
%%sql
CREATE VIEW all_countries AS
SELECT country,
CASE
WHEN COUNT(DISTINCT(c.customer_id)) = 1 THEN "Other"
ELSE country
END AS country_grp,
COUNT(DISTINCT(c.customer_id)) AS number_of_customers,
COUNT(DISTINCT(i.invoice_id)) AS number_of_orders,
SUM(i.total) AS total_value_of_sales,
CASE
WHEN COUNT(DISTINCT(c.customer_id)) = 1 THEN 0
ELSE 1
END AS sort
FROM customer AS c
INNER JOIN invoice AS i ON c.customer_id = i.customer_id
GROUP BY country;
Done.
[]
Preview of the view created above
%%sql
SELECT *
FROM all_countries
LIMIT 5;
Done.
country | country_grp | number_of_customers | number_of_orders | total_value_of_sales | sort |
---|---|---|---|---|---|
Argentina | Other | 1 | 5 | 39.6 | 0 |
Australia | Other | 1 | 10 | 81.18 | 0 |
Austria | Other | 1 | 9 | 69.3 | 0 |
Belgium | Other | 1 | 7 | 60.38999999999999 | 0 |
Brazil | Brazil | 5 | 61 | 427.68000000000006 | 1 |
SQL query to show the following:
Group by country with "Other" country group at the bottom
%%sql
SELECT country_grp AS country,
SUM(number_of_customers) AS total_number_of_customers,
ROUND(SUM(total_value_of_sales), 2) AS total_value_of_sales,
ROUND((SUM(total_value_of_sales) / SUM(number_of_customers)), 2) AS average_value_of_sales_per_customer,
ROUND((SUM(total_value_of_sales) / SUM(number_of_orders)), 2) AS average_order_value
FROM all_countries
GROUP BY country_grp
ORDER BY sort DESC, total_value_of_sales DESC;
Done.
country | total_number_of_customers | total_value_of_sales | average_value_of_sales_per_customer | average_order_value |
---|---|---|---|---|
USA | 13 | 1040.49 | 80.04 | 7.94 |
Canada | 8 | 535.59 | 66.95 | 7.05 |
Brazil | 5 | 427.68 | 85.54 | 7.01 |
France | 5 | 389.07 | 77.81 | 7.78 |
Germany | 4 | 334.62 | 83.66 | 8.16 |
Czech Republic | 2 | 273.24 | 136.62 | 9.11 |
United Kingdom | 3 | 245.52 | 81.84 | 8.77 |
Portugal | 2 | 185.13 | 92.57 | 6.38 |
India | 2 | 183.15 | 91.58 | 8.72 |
Other | 15 | 1094.94 | 73.0 | 7.45 |
USA accounts for most customers and dollar sales. 15 countries had 1 customer each.
Chinook wants to understand the split between purchases of full albums and purchases of individual tracks
With SQL, create a view listing the track count for each album id available for sale.
%%sql
CREATE VIEW album_track_count AS
SELECT album_id,
COUNT(track_id) AS album_track_count
FROM track
GROUP BY album_id;
Done.
[]
Preview of the view created above
%%sql
SELECT *
FROM album_track_count
LIMIT 5;
Done.
album_id | album_track_count |
---|---|
1 | 10 |
2 | 1 |
3 | 3 |
4 | 8 |
5 | 15 |
SQL query to list the following:
By comparing 2) and 3) we can determine whether each album on the invoice was a full album purchase or a partial album purchase.
%%sql
CREATE VIEW invoice_album_full_partial_classification AS
SELECT il.invoice_id,
t.album_id,
COUNT(il.track_id) AS album_tracks_purchased,
atc.album_track_count,
CASE
WHEN COUNT(il.track_id) = atc.album_track_count THEN "Full Album Purchase"
ELSE "Partial Album Purchase"
END AS classification
FROM invoice_line AS il
INNER JOIN track AS t ON il.track_id = t.track_id
INNER JOIN album_track_count AS atc ON t.album_id = atc.album_id
GROUP BY il.invoice_id, t.album_id;
Done.
[]
Preview of the view created above
%%sql
SELECT *
FROM invoice_album_full_partial_classification
LIMIT 5;
Done.
invoice_id | album_id | album_tracks_purchased | album_track_count | classification |
---|---|---|---|---|
1 | 91 | 16 | 16 | Full Album Purchase |
2 | 20 | 1 | 11 | Partial Album Purchase |
2 | 34 | 1 | 17 | Partial Album Purchase |
2 | 39 | 1 | 21 | Partial Album Purchase |
2 | 66 | 1 | 10 | Partial Album Purchase |
SQL query to summarize full album purchases and partial album purchases by number of tracks sold
%%sql
SELECT classification AS purchased_as_part_of,
SUM(album_tracks_purchased) AS tracks_purchased,
ROUND((CAST(SUM(album_tracks_purchased) AS Float)) / (SELECT SUM(album_tracks_purchased)
FROM invoice_album_full_partial_classification), 2)
AS pct_of_total_tracks
FROM invoice_album_full_partial_classification
GROUP BY classification;
Done.
purchased_as_part_of | tracks_purchased | pct_of_total_tracks |
---|---|---|
Full Album Purchase | 1569 | 0.33 |
Partial Album Purchase | 3188 | 0.67 |
Full album purchases account for a third of all tracks sold. Not having full albums available could potentially cause customer dissatisfaction and boost competitors' sales. Therefore, my recommendation is to continue to buy full albums from record companies.
Business question: Which artist is used in the most playlists?
SQL query to list the name of the artist featured in most playlists
%%sql
SELECT ar.name,
COUNT(DISTINCT(pt.playlist_id)) AS number_of_playlists
FROM artist AS ar
LEFT JOIN album AS al ON ar.artist_id = al.artist_id
LEFT JOIN track AS t ON al.album_id = t.album_id
LEFT JOIN playlist_track AS pt ON t.track_id = pt.track_id
GROUP BY ar.name
ORDER BY number_of_playlists DESC
LIMIT 5;
Done.
name | number_of_playlists |
---|---|
Eugene Ormandy | 7 |
The King's Singers | 6 |
English Concert & Trevor Pinnock | 6 |
Berliner Philharmoniker & Herbert Von Karajan | 6 |
Academy of St. Martin in the Fields & Sir Neville Marriner | 6 |
The artist featured in most playlists is Eugene Ormandy (7 playlists).
Business question: how many tracks have been purchased vs not purchased?
With SQL create a view to classify each track available in the store as "Purchased" at least once or "Not Purchased". This is determined by comparing all tracks in the "track" table to all tracks in the "invoice_line" table.
%%sql
CREATE VIEW track_purchased_yes_no_classification AS
SELECT track_id,
CASE
WHEN track_id IN (SELECT DISTINCT(track_id)
FROM invoice_line) THEN "Purchased"
ELSE "Not Purchased"
END AS purchased_yes_no
FROM track;
Done.
[]
Preview of the view created above
%%sql
SELECT *
FROM track_purchased_yes_no_classification
WHERE track_id > 2012
LIMIT 5;
Done.
track_id | purchased_yes_no |
---|---|
2013 | Purchased |
2014 | Purchased |
2015 | Not Purchased |
2016 | Not Purchased |
2017 | Not Purchased |
SQL query to summarize tracks purchased and not purchased
%%sql
SELECT purchased_yes_no,
COUNT(track_id) AS track_count,
ROUND(CAST(COUNT(track_id) AS Float) / (SELECT COUNT(*)
FROM track), 3) * 100
AS pct_of_all_tracks
FROM track_purchased_yes_no_classification
GROUP BY purchased_yes_no;
Done.
purchased_yes_no | track_count | pct_of_all_tracks |
---|---|---|
Not Purchased | 1697 | 48.4 |
Purchased | 1806 | 51.6 |
51.6% of all tracks available were purchased at least once. 48.4% of tracks were never purchased.
Business question: is the range of tracks in the store reflective of their sales popularity?
First, verify that the maximum invoice quantity of any track is 1, then we don't have to use the "invoice" table to answer this question
%%sql
SELECT MAX(quantity)
FROM invoice_line;
Done.
MAX(quantity) |
---|
1 |
We can see that not a single invoice had a track purchase quantity greater than 1.
SQL query to list the following:
Then we can compare 2) and 4) to see
%%sql
SELECT g.name,
COUNT(DISTINCT(t.track_id)) AS tracks_available,
ROUND(CAST(COUNT(DISTINCT(t.track_id)) AS Float) / (SELECT (COUNT(*))
FROM track) * 100, 2)
AS pct_of_total_tracks_available,
ROUND(SUM(il.unit_price)) AS dollar_sales,
ROUND(CAST(SUM(il.unit_price) AS Float) / (SELECT SUM(unit_price)
FROM invoice_line) * 100, 2)
AS pct_of_total_dollar_sales
FROM track AS t
INNER JOIN genre AS g ON t.genre_id = g.genre_id
LEFT JOIN invoice_line AS il ON t.track_id = il.track_id
GROUP BY g.genre_id
ORDER BY pct_of_total_dollar_sales DESC;
Done.
name | tracks_available | pct_of_total_tracks_available | dollar_sales | pct_of_total_dollar_sales |
---|---|---|---|---|
Rock | 1297 | 37.03 | 2609.0 | 55.39 |
Metal | 374 | 10.68 | 613.0 | 13.01 |
Alternative & Punk | 332 | 9.48 | 487.0 | 10.34 |
Latin | 579 | 16.53 | 165.0 | 3.51 |
R&B/Soul | 61 | 1.74 | 157.0 | 3.34 |
Blues | 81 | 2.31 | 123.0 | 2.61 |
Jazz | 130 | 3.71 | 120.0 | 2.54 |
Alternative | 40 | 1.14 | 116.0 | 2.46 |
Easy Listening | 24 | 0.69 | 73.0 | 1.56 |
Pop | 48 | 1.37 | 62.0 | 1.32 |
Electronica/Dance | 30 | 0.86 | 54.0 | 1.16 |
Classical | 74 | 2.11 | 47.0 | 0.99 |
Reggae | 58 | 1.66 | 35.0 | 0.74 |
Hip Hop/Rap | 35 | 1.0 | 33.0 | 0.69 |
Heavy Metal | 28 | 0.8 | 8.0 | 0.17 |
Soundtrack | 43 | 1.23 | 5.0 | 0.11 |
TV Shows | 93 | 2.65 | 2.0 | 0.04 |
Drama | 64 | 1.83 | 1.0 | 0.02 |
Rock And Roll | 12 | 0.34 | None | None |
Bossa Nova | 15 | 0.43 | None | None |
World | 28 | 0.8 | None | None |
Science Fiction | 13 | 0.37 | None | None |
Sci Fi & Fantasy | 26 | 0.74 | None | None |
Comedy | 17 | 0.49 | None | None |
Opera | 1 | 0.03 | None | None |
Some genres do not get their fair share of total tracks judging by dollar sales. We can recommend that the store adds tracks in the following genres:
On the flip side, certain genres have a greater share of total tracks than share of total sales. We can recommend that the store cuts or does not add additional tracks in the following genres:
7 genres do not have any sales:
Business question: do protected vs non-protected media types have an effect on popularity?
Use SQL to view the list of possible media types
%%sql
SELECT DISTINCT(name)
FROM media_type;
Done.
name |
---|
MPEG audio file |
Protected AAC audio file |
Protected MPEG-4 video file |
Purchased AAC audio file |
AAC audio file |
2 file types are protected, and 3 are not protected.
With SQL, create a view to classify all tracks available as protected or not protected, and show dollar sales for each track.
%%sql
CREATE VIEW track_sales_with_media_type AS
SELECT t.track_id,
SUM(il.unit_price) AS dollar_sales,
CASE
WHEN m.name LIKE "Protected%" THEN "Yes"
ELSE "No"
END AS protected_yes_no
FROM track AS t
INNER JOIN media_type AS m ON t.media_type_id = m.media_type_id
LEFT JOIN invoice_line AS il ON t.track_id = il.track_id
GROUP BY t.track_id;
Done.
[]
Preview of the view created above.
%%sql
SELECT *
FROM track_sales_with_media_type
LIMIT 5;
Done.
track_id | dollar_sales | protected_yes_no |
---|---|---|
1 | 7.920000000000001 | No |
2 | 2.9699999999999998 | Yes |
3 | 0.99 | Yes |
4 | 3.96 | Yes |
5 | 0.99 | Yes |
With SQL, summarize tracks sold by file protection mode (protected or not protected media type)
%%sql
SELECT protected_yes_no,
ROUND(SUM(dollar_sales), 2) AS total_sales,
ROUND(SUM(dollar_sales) / (SELECT SUM(dollar_sales)
FROM track_sales_with_media_type) * 100, 1)
AS share_of_sales
FROM track_sales_with_media_type
GROUP BY protected_yes_no;
Done.
protected_yes_no | total_sales | share_of_sales |
---|---|---|
No | 4271.85 | 90.7 |
Yes | 437.58 | 9.3 |
Protected files account for only 9.3% of total sales.
These are our recommendations based on the analysis performed: