SQL Project 1: Practise of analyzing Business question

In this practise, we will go through the Chinook store - a house of high quality songs and media video holder - in term as a member of Data Analyst team, and answer some question about the current business situation of Chinook.

The database we're using for this project is chinook.db, a large databse contain a lot of infomation about Chinook's business.

In [2]:
%%capture
%load_ext sql
%sql sqlite:///chinook.db
%sql
In [3]:
%sql
 * sqlite:///chinook.db
Out[3]:
'Connected: @chinook.db'

First, let's review the relative schema on each table to get a real taste about the database we will work with: Image

In [4]:
%%sql
SELECT
    name,
    type
FROM sqlite_master
WHERE type IN ("table","view");
 * sqlite:///chinook.db
Done.
Out[4]:
name type
album table
artist table
customer table
employee table
genre table
invoice table
invoice_line table
media_type table
playlist table
playlist_track table
track table
employee_hiredate view
invoice_with_time view

Because there are some odd object (the two strange modified view by other operations) => We will delete these view to get the original database.

In [5]:
%%sql
DROP VIEW IF EXISTS customer_get_over_90_dollar;
DROP VIEW IF EXISTS customer_USA;
 * sqlite:///chinook.db
Done.
Done.
Out[5]:
[]

Let's do a query to check the database again:

In [6]:
%%sql
SELECT
    name,
    type
FROM sqlite_master
WHERE type IN ("table","view");
 * sqlite:///chinook.db
Done.
Out[6]:
name type
album table
artist table
customer table
employee table
genre table
invoice table
invoice_line table
media_type table
playlist table
playlist_track table
track table
employee_hiredate view
invoice_with_time view

The database have back to its original statement, let's do some explore data to get familiar with database:

In [7]:
%%sql
SELECT *
FROM track
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[7]:
track_id name album_id media_type_id genre_id composer milliseconds bytes unit_price
1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 230619 3990994 0.99
4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 252051 4331779 0.99
5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99
In [8]:
%%sql
SELECT *
FROM playlist;
 * sqlite:///chinook.db
Done.
Out[8]:
playlist_id name
1 Music
2 Movies
3 TV Shows
4 Audiobooks
5 90’s Music
6 Audiobooks
7 Movies
8 Music
9 Music Videos
10 TV Shows
11 Brazilian Music
12 Classical
13 Classical 101 - Deep Cuts
14 Classical 101 - Next Steps
15 Classical 101 - The Basics
16 Grunge
17 Heavy Metal Classic
18 On-The-Go 1
In [9]:
%%sql
SELECT *
FROM customer
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[9]:
customer_id first_name last_name company address city state country postal_code phone fax email support_rep_id
1 Luís Gonçalves Embraer - Empresa Brasileira de Aeronáutica S.A. Av. Brigadeiro Faria Lima, 2170 São José dos Campos SP Brazil 12227-000 +55 (12) 3923-5555 +55 (12) 3923-5566 [email protected] 3
2 Leonie Köhler None Theodor-Heuss-Straße 34 Stuttgart None Germany 70174 +49 0711 2842222 None [email protected] 5
3 François Tremblay None 1498 rue Bélanger Montréal QC Canada H2G 1A7 +1 (514) 721-4711 None [email protected] 3
4 Bjørn Hansen None Ullevålsveien 14 Oslo None Norway 0171 +47 22 44 22 22 None [email protected] 4
5 František Wichterlová JetBrains s.r.o. Klanova 9/506 Prague None Czech Republic 14700 +420 2 4172 5555 +420 2 4172 5555 [email protected] 4

Task 1: Find which genres sell the most tracks in the USA

Scenario: The Chinook record store has just signed a deal with a new record label, and we've been tasked with selecting the first three albums that will be added to the store, from a list of four (All these artist have not yet in the store):

Artist Name Genre
Regal Hip-Hop
Red Tone Punk
Meteor and the Girls Pop
Slim Jim Bites Blues

The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

In [10]:
%%sql
WITH customer_info_USA AS
(SELECT 
    c.customer_id as customer_id,
    il.invoice_id as invoice_id, il.track_id as track_id,
    count(il.quantity) as number_order
    
FROM invoice i
LEFT JOIN customer c ON i.customer_id = c.customer_id
LEFT JOIN invoice_line il ON i.invoice_id = il.invoice_id
WHERE c.country = 'USA'
GROUP BY c.customer_id),

most_buy AS 
(SELECT 
    customer_id, track_id,
    max(number_order) as most_order
    
FROM customer_info_USA
GROUP BY customer_id)

SELECT 
    g.name as genre_name,
    most_order,  
    round((cast(most_order as float)/cast(
        (SELECT count(il.quantity) as total_international
        FROM invoice i
    LEFT JOIN customer c ON i.customer_id = c.customer_id
    LEFT JOIN invoice_line il ON i.invoice_id = il.invoice_id
    GROUP BY c.country
    HAVING c.country = 'USA') as float))*100,2) as sold_ratio
FROM most_buy mb
INNER JOIN track t ON mb.track_id = t.track_id
INNER JOIN genre g ON t.genre_id = g.genre_id
GROUP BY g.name
ORDER BY most_order DESC
 * sqlite:///chinook.db
Done.
Out[10]:
genre_name most_order sold_ratio
Blues 93 8.85
R&B/Soul 87 8.28
Alternative & Punk 77 7.33
Rock 75 7.14
Metal 67 6.37

Follow the most popular to the less popular, we have priority: `Blues` > `R&B / Soul` > `Alternative & Punk` >`Rock`> `Metal`

With this result, the first three (or less) album need to be add in our store will be:

Artist Name Genre
Slim Jim Bites Blues
Red Tone Punk

Task 2: Analyzing Employee Sales Performance

Scenario: Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. We want to performing an analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

First, let's review again about the invoice table, and see what their effective to sale agent's KPI.

In [11]:
%%sql
SELECT *
FROM invoice
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[11]:
invoice_id customer_id invoice_date billing_address billing_city billing_state billing_country billing_postal_code total
1 18 2017-01-03 00:00:00 627 Broadway New York NY USA 10012-2612 15.84
2 30 2017-01-03 00:00:00 230 Elgin Street Ottawa ON Canada K2P 1L7 9.9
3 40 2017-01-05 00:00:00 8, Rue Hanovre Paris None France 75002 1.98
4 18 2017-01-06 00:00:00 627 Broadway New York NY USA 10012-2612 7.92
5 27 2017-01-07 00:00:00 1033 N Park Ave Tucson AZ USA 85719 16.83

We will start by find the number of customer each Sales Support Agent is assigned, follow with the total cost that each Sales Support Agent gain for our store.

In [12]:
%%sql
WITH employee_info AS
(SELECT *
FROM employee
WHERE title = 'Sales Support Agent'),

total_customer AS
(SELECT count(c.customer_id) as tot_cus
FROM employee_info ei
INNER JOIN customer c ON ei.employee_id = c.support_rep_id
GROUP BY ei.employee_id)

SELECT 
    ei.first_name||' '||ei.last_name as employee_name,
    ei.title as position,
    (SELECT tot_cus FROM total_customer) as number_of_cus,
    sum(iwt.total) as total_cost_gain
FROM invoice_with_time iwt
LEFT JOIN customer c ON iwt.customer_id = c.customer_id
INNER JOIN employee_info ei ON c.support_rep_id = ei.employee_id
GROUP BY c.support_rep_id
ORDER BY employee_name ASC;
 * sqlite:///chinook.db
Done.
Out[12]:
employee_name position number_of_cus total_cost_gain
Jane Peacock Sales Support Agent 21 1731.510000000004
Margaret Park Sales Support Agent 21 1584.0000000000032
Steve Johnson Sales Support Agent 21 1393.9200000000028

To get more taste about the current performance of three Sales Support Agent, we will add to the invoice table two new columns: month as invoice date by month, and year, track the invoice by year.

In [13]:
# %%sql
# DROP VIEW IF EXISTS invoice_with_time;

# CREATE VIEW invoice_with_time AS
# SELECT 
#     *,
#     strftime('%m', invoice_date) as month,
#     strftime('%Y', invoice_date) as year
# FROM invoice
In [14]:
# %%sql
# DROP VIEW IF EXISTS employee_hiredate;

# CREATE VIEW employee_hiredate AS
# SELECT 
#     *,
#     strftime('%Y', hire_date) as year
# FROM employee
In [15]:
%%sql
WITH employee_info AS
(SELECT *
FROM employee
WHERE title = 'Sales Support Agent'),

customer_cat_17 AS
(SELECT 
    c.support_rep_id,
    sum(iwt.total) as total_17
FROM invoice_with_time iwt
LEFT JOIN customer c ON iwt.customer_id = c.customer_id
WHERE year = '2017'
GROUP BY c.support_rep_id),

customer_cat_18 AS
(SELECT 
    c.support_rep_id,
    sum(iwt.total) as total_18
FROM invoice_with_time iwt
LEFT JOIN customer c ON iwt.customer_id = c.customer_id
WHERE year = '2018'
GROUP BY c.support_rep_id),

customer_cat_19 AS
(SELECT 
    c.support_rep_id,
    sum(iwt.total) as total_19
FROM invoice_with_time iwt
LEFT JOIN customer c ON iwt.customer_id = c.customer_id
WHERE year = '2019'
GROUP BY c.support_rep_id),

customer_cat_20 AS
(SELECT 
    c.support_rep_id,
    sum(iwt.total) as total_20
FROM invoice_with_time iwt
LEFT JOIN customer c ON iwt.customer_id = c.customer_id
WHERE year = '2020'
GROUP BY c.support_rep_id)

SELECT 
    ei.first_name||' '||ei.last_name as employee_name,
    ei.hire_date as start_date,
    ei.title as position, 
    c17.total_17 as total_cost_gain_17,
    c18.total_18 as total_cost_gain_18,
    c19.total_19 as total_cost_gain_19,
    c20.total_20 as total_cost_gain_20
FROM employee_info ei
INNER JOIN customer_cat_17 c17 ON ei.employee_id = c17.support_rep_id
INNER JOIN customer_cat_18 c18 ON ei.employee_id = c18.support_rep_id
INNER JOIN customer_cat_19 c19 ON ei.employee_id = c19.support_rep_id
INNER JOIN customer_cat_20 c20 ON ei.employee_id = c20.support_rep_id
GROUP BY c20.support_rep_id
 * sqlite:///chinook.db
Done.
Out[15]:
employee_name start_date position total_cost_gain_17 total_cost_gain_18 total_cost_gain_19 total_cost_gain_20
Jane Peacock 2017-04-01 00:00:00 Sales Support Agent 532.6200000000001 413.82000000000005 383.13000000000005 401.93999999999994
Margaret Park 2017-05-03 00:00:00 Sales Support Agent 416.78999999999996 400.95000000000005 400.94999999999993 365.31
Steve Johnson 2017-10-17 00:00:00 Sales Support Agent 252.44999999999993 332.64 437.58000000000015 371.25000000000006
In [16]:
%%sql
WITH employee_info_1 AS
(SELECT *
FROM employee_hiredate
WHERE title = 'Sales Support Agent'),

num_cus_17 AS
(SELECT ei1.employee_id, count(c.customer_id) as number_customer_17
FROM employee_info_1 ei1
INNER JOIN customer c ON ei1.employee_id = c.support_rep_id
WHERE year = '2017'
GROUP BY ei1.employee_id),

num_cus_18 AS
(SELECT ei1.employee_id, count(c.customer_id) as number_customer_18
FROM employee_info_1 ei1 
INNER JOIN customer c ON ei1.employee_id = c.support_rep_id
WHERE year = '2018'
GROUP BY ei1.employee_id),

num_cus_19 AS
(SELECT ei1.employee_id, count(c.customer_id) as number_customer_19
FROM employee_info_1 ei1 
INNER JOIN customer c ON ei1.employee_id = c.support_rep_id
WHERE year = '2019'
GROUP BY ei1.employee_id),

num_cus_20 AS
(SELECT ei1.employee_id, count(c.customer_id) as number_customer_20
FROM employee_info_1 ei1 
INNER JOIN customer c ON ei1.employee_id = c.support_rep_id
WHERE year = '2020'
GROUP BY ei1.employee_id
)

SELECT 
    ei1.first_name||' '||ei1.last_name as employee_name,
    ei1.hire_date as start_date,
    ei1.title as position,
    (SELECT number_customer_17 FROM num_cus_17) as total_customer_17,
    (SELECT number_customer_18 FROM num_cus_18) as total_customer_18,
    (SELECT number_customer_19 FROM num_cus_19) as total_customer_19,
    (SELECT number_customer_20 FROM num_cus_20) as total_customer_20
FROM employee_info_1 ei1
GROUP BY ei1.employee_id
 * sqlite:///chinook.db
Done.
Out[16]:
employee_name start_date position total_customer_17 total_customer_18 total_customer_19 total_customer_20
Jane Peacock 2017-04-01 00:00:00 Sales Support Agent 21 None None None
Margaret Park 2017-05-03 00:00:00 Sales Support Agent 21 None None None
Steve Johnson 2017-10-17 00:00:00 Sales Support Agent 21 None None None

Follow the result, we can see the performance of each Sales Support Agent isn't on line by the time:

  • Three sales support agent was assigned for the equal amount of customer from the first, there is no relative with the join date in company is lastest or earliest: 21 customer
  • The best sales support agent was Jane Peacock because him/her gained cost back for store is highest from the first (532.62 USD), even next years him/her performance has a little down but it's not a matter with the overall performance.
  • The worst sales support agent was Steve Johnson because with 21 customers he/she have to support, in 2017 him/her back gained cost just 252.45 USD, even though he/she was efforting to increase the gained cost but in overall, this is the worst sales support employee.

Task 3: Analyzing Sales by Country

Scenario: The next task is monitoring the sales condition for each customer by each different countries, we will use only the country data in customer table (because for the other table they could be missing any value)

The expected output result:

  • total number of customers
  • total value of sales
  • average value of sales per customer
  • average order value

Because there are a number of countries with only one customer, we will force it to be Other group, like the demo-result below. The reuslt will be more complex information and will be shown below the demo result.

In [17]:
%%sql
WITH semi_result_1 AS
(SELECT 
    country, count(customer_id) as total_customer,
    CASE
        WHEN count(customer_id) = 1 THEN 'Other'
        ELSE country
    END AS country_grp
FROM customer
GROUP BY country)

SELECT country_grp, total_customer
FROM (
        SELECT *,
            CASE
                WHEN country_grp = 'Other' THEN 1
                ELSE 0
            END AS sort
        FROM semi_result_1)
GROUP BY country_grp
ORDER BY sort;
 * sqlite:///chinook.db
Done.
Out[17]:
country_grp total_customer
Brazil 5
Canada 8
Czech Republic 2
France 5
Germany 4
India 2
Portugal 2
USA 13
United Kingdom 3
Other 1
In [18]:
%%sql
WITH total_cost AS
(SELECT 
    c.country,
    sum(i.total) as total_value    
FROM invoice i
INNER JOIN customer c ON i.customer_id = c.customer_id
GROUP BY country),

customer_info AS
(SELECT country, count(customer_id) as total_customer
FROM customer
GROUP BY country),

semi_combine AS
(SELECT 
    ci.country,
    ci.total_customer,
    ts.total_value,
    ts.total_value/ci.total_customer as ave_per_customer,
    CASE
        WHEN total_customer = 1 THEN 'Other'
        ELSE ci.country
    END AS country_grp
FROM customer_info ci
INNER JOIN total_cost ts ON ci.country = ts.country
GROUP BY ci.country)

SELECT country, country_grp, total_customer, total_value, ave_per_customer
FROM (
        SELECT *,
            CASE 
                WHEN country_grp = 'Other' THEN 1
                ELSE 0
            END AS sort
        FROM semi_combine)
ORDER BY sort ASC, total_value DESC
 * sqlite:///chinook.db
Done.
Out[18]:
country country_grp total_customer total_value ave_per_customer
USA USA 13 1040.4899999999998 80.0376923076923
Canada Canada 8 535.5900000000001 66.94875000000002
Brazil Brazil 5 427.68000000000006 85.53600000000002
France France 5 389.0699999999999 77.81399999999998
Germany Germany 4 334.62 83.655
Czech Republic Czech Republic 2 273.24000000000007 136.62000000000003
United Kingdom United Kingdom 3 245.52 81.84
Portugal Portugal 2 185.13000000000002 92.56500000000001
India India 2 183.14999999999998 91.57499999999999
Ireland Other 1 114.83999999999997 114.83999999999997
Spain Other 1 98.01 98.01
Chile Other 1 97.02000000000001 97.02000000000001
Australia Other 1 81.18 81.18
Finland Other 1 79.2 79.2
Hungary Other 1 78.21 78.21
Poland Other 1 76.22999999999999 76.22999999999999
Sweden Other 1 75.24 75.24
Norway Other 1 72.27000000000001 72.27000000000001
Austria Other 1 69.3 69.3
Netherlands Other 1 65.34 65.34
Belgium Other 1 60.38999999999999 60.38999999999999
Italy Other 1 50.49 50.49
Argentina Other 1 39.6 39.6
Denmark Other 1 37.61999999999999 37.61999999999999

Look at the result, we can see that:

  • Our source of cost is from USA with total 13 customers and gained 1040 USD. Because we got 13 customers, so that, each approximately 80 USD from each customer/ each order could be a house of gold for us.
  • The 2nd is Canada with 8 customers, and each customer cost 67 USD
  • We have Other group, because in this group is all countries with just 1 customers but their cost per order isn't small, it's quite good but we'd like to inivte more customers on these country to get some more revenue back.

Task 4: Album vs Individual Tracks

Scenario: The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:

  • purchase a whole album
  • purchase a collection of one or more individual tracks.

The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.

Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.

In order to answer the question, we're going to have to identify whether each invoice has all the tracks from an album. We can do this by getting the list of tracks from an invoice and comparing it to the list of tracks from an album. We can find the album to compare the purchase to by looking up the album that one of the purchased tracks belongs to. It doesn't matter which track we pick, since if it's an album purchase, that album will be the same for all tracks.

First, we will look through some records in track table to see what field we got:

In [19]:
%%sql
SELECT *
FROM track
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[19]:
track_id name album_id media_type_id genre_id composer milliseconds bytes unit_price
1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 230619 3990994 0.99
4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 252051 4331779 0.99
5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99
In [20]:
%%sql
SELECT *
FROM album
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[20]:
album_id title artist_id
1 For Those About To Rock We Salute You 1
2 Balls to the Wall 2
3 Restless and Wild 2
4 Let There Be Rock 1
5 Big Ones 3

In this task, we temporary ignore about artist_id, and we can see that:

  • Some of track_id is represented for the wholde album, ex: track_id : 1 is represent for album For Those About...
  • Some of track_id likely separated song in album, ex: track_id :3, 4 ,5 is represent for album Restless and Wild

We'll get back to the invoice_line table again, to determine what to do first:

In [21]:
%%sql
SELECT *
FROM invoice_line
LIMIT 5;
 * sqlite:///chinook.db
Done.
Out[21]:
invoice_line_id invoice_id track_id unit_price quantity
1 1 1158 0.99 1
2 1 1159 0.99 1
3 1 1160 0.99 1
4 1 1161 0.99 1
5 1 1162 0.99 1

We will do the following:

    1. Count the possible track on each album (original track number on each album, where it's not effect by invoice)
    1. Count the number of track is occured for each invoice occured

We're interesting in whether each invoice is order full album or get more number of track more than a whole album can contain, so instead of get each track's name from each invoice and compare it to each album for each invoice contain, we will directly compare the number of track occured on each invoice to the possible track in a whole album.

We're all know that customers can't be order a whole album AND some of external track outside AT THE SAME TIME; even the edge case is customers may decide to manually select every track from an album, and then add a few individual tracks from other albums to their purchase <= this edge case could be happend BUT follow the separated analysis of company, this case is so rarely happend, we can consider that this edge case can be ignore => We can perform this analysis by our step above.

In [22]:
%%sql
WITH check_list AS
(SELECT 
    a.album_id,
    count(t.track_id) as item_in_album
FROM album a
INNER JOIN track t ON a.album_id = t.album_id
GROUP BY a.album_id),

from_invoice AS
(SELECT 
    il.invoice_id,
    count(t.track_id) as item_in_invoice,
    a.album_id as album_code   
FROM invoice_line il
LEFT JOIN track t ON il.track_id = t.track_id
LEFT JOIN album a ON t.album_id = a.album_id
GROUP BY il.invoice_id),

evaluate_1 AS
(SELECT 
     ci.album_id,
     ci.item_in_album
FROM check_list ci
INNER JOIN from_invoice fi ON ci.album_id = fi.album_code
GROUP BY ci.album_id),

evaluate_2 AS
(SELECT
     fi.album_code,
     fi.invoice_id,
     fi.item_in_invoice
FROM from_invoice fi
INNER JOIN check_list ci ON fi.album_code = ci.album_id
GROUP BY fi.album_code),

semi_result AS
(SELECT 
    e2.invoice_id,
    CASE
        WHEN e2.item_in_invoice = e1.item_in_album THEN 'Yes'
        ELSE 'No'
    END AS full_album_or_not
FROM evaluate_1 e1
INNER JOIN evaluate_2 e2 ON e1.album_id = e2.album_code
GROUP BY e2.album_code),

result_1 AS 
(SELECT 
    COUNT(full_album_or_not) as full_album
FROM semi_result
WHERE full_album_or_not = 'Yes'
)

SELECT 
    (SELECT * FROM result_1) as full_album, 
    (SELECT 
    COUNT(full_album_or_not) as full_album
FROM semi_result
WHERE full_album_or_not = 'No') as other,
    CAST((SELECT * FROM result_1) as float)/CAST(COUNT(full_album_or_not) as float)*100 as percentage_f,
    100 - (CAST((SELECT * FROM result_1) as float)/CAST(COUNT(full_album_or_not) as float)*100) as percentage_o
FROM semi_result
 * sqlite:///chinook.db
Done.
Out[22]:
full_album other percentage_f percentage_o
32 136 19.047619047619047 80.95238095238095

In the result, we can see that ALMOST invoice is order for less or more track song than a whole album (81%) and 20% the rest of purchased the full album. That's mean in 20% purchased the full album we can get the state cost, and for the rest, we can get so much more or a little less cost back => A suggest that the company can keep the purchased method like the current, and focus on the other things to improve, not by change the purchase method