In this project, we will use SQL to answer business questions on the Chinook company database. Chinook is an online music store that sells music in a variety of formats. This database contains information about Chinook's sales, employees, and customers.
We will answer the following questions:
%%capture %load_ext sql %sql sqlite:///chinook.db
%%sql SELECT name, type FROM sqlite_master WHERE type in ('table', 'VIEW');
Chinook's database contains 11 tables, as listed above.
%%sql SELECT * FROM invoice_line LIMIT 5;
The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce. The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.
%%sql WITH usa AS ( SELECT * FROM customer c INNER JOIN invoice i ON i.customer_id = c.customer_id INNER JOIN invoice_line il ON il.invoice_id = i.invoice_id WHERE country = 'USA' ) SELECT g.name AS genre, COUNT(usa.invoice_line_id) AS total_tracks, CAST(COUNT(usa.invoice_line_id) as float) / (SELECT COUNT(*) FROM usa) * 100 percentage FROM usa INNER JOIN track t ON t.track_id = usa.track_id INNER JOIN genre g ON g.genre_id = t.genre_id GROUP BY 1 ORDER BY 2 DESC;
|Alternative & Punk||130||12.369172216936251|
The table above shows Chinook's best selling genres in the US. The genres of the potential new artists are Hip Hop, Punk, Pop, and Blues. Since this data shows that Punk, Blues, and Pop are the 3 best selling of these 4 genres, Chinook may want to purchase the albums from Red Tone (Punk), Meteor and the Girls (Pop), and Slim Jim Bites (Blues). It is worth noting that there was very little difference in the popularity of Pop vs. Hip Hop (.15%), and each of these genres made up about 2% of total sales. Together, these 4 genres made up only about 20% of total US sales. However, Rock is by far the best selling genre, accounting for 53% of sales alone, so if a Rock album became available in the future Chinook may want to consider purchasing it.
%%sql WITH total_per_agent AS ( SELECT c.customer_id, c.support_rep_id, SUM(i.total) agent_total FROM customer c LEFT JOIN invoice i ON i.customer_id = c.customer_id GROUP BY 2 ) SELECT e.first_name || ' ' || e.last_name AS employee_name, ROUND(t.agent_total, 2) AS total_sales, e.title AS employee_role, e.hire_date AS hire_date FROM employee e LEFT JOIN total_per_agent t ON t.support_rep_id = e.employee_id ORDER BY 2 DESC;
|Jane Peacock||1731.51||Sales Support Agent||2017-04-01 00:00:00|
|Margaret Park||1584.0||Sales Support Agent||2017-05-03 00:00:00|
|Steve Johnson||1393.92||Sales Support Agent||2017-10-17 00:00:00|
|Andrew Adams||None||General Manager||2016-08-14 00:00:00|
|Nancy Edwards||None||Sales Manager||2016-05-01 00:00:00|
|Michael Mitchell||None||IT Manager||2016-10-17 00:00:00|
|Robert King||None||IT Staff||2017-01-02 00:00:00|
|Laura Callahan||None||IT Staff||2017-03-04 00:00:00|
Of Chinook's 8 employees, 3 are Sales Support Agents. The difference in sales between the employees may be a result of their hire date - the employee with the earliest hire date has the most sales, and the employee with the most recent hire date has the least sales, with a total difference of about 20%.
%%sql WITH country_or_other AS ( SELECT CASE WHEN (SELECT COUNT(*) FROM customer WHERE country = c.country) = 1 THEN 'Other' ELSE c.country END AS country, c.customer_id, il.* FROM invoice_line il INNER JOIN invoice i ON i.invoice_id = il.invoice_id INNER JOIN customer c on c.customer_id = i.customer_id ) SELECT country, num_cust, total_sales, avg_total_sales_per_cust, avg_order_val FROM (SELECT country, COUNT(DISTINCT customer_id) num_cust, ROUND(SUM(unit_price), 2) total_sales, ROUND(CAST(SUM(unit_price) as float) / CAST (COUNT(DISTINCT customer_id) as float), 2) avg_total_sales_per_cust, ROUND(CAST(SUM(unit_price) as float) / CAST(COUNT(DISTINCT invoice_id) as float), 2) avg_order_val, CASE WHEN country = 'Other' THEN 1 ELSE 0 END AS sort FROM country_or_other GROUP BY 1 ORDER BY sort, 5 DESC);
The largest portion of Chinook's customers and total sales is concentrated in the North America region, specifically the USA and Canada. Since this market accounts for a large volume of sales, it may be important to consider their purchasing patterns when looking at new artists or albums.
However, other specific countries may offer good business opportunites. The Czech Republic, Portugal, and India have the highest average lifetime values per customer. The Czech Republic, UK, India, and Germany have the highest average order value. These observations are based on limited data, as each of these countries represents only 2-4 total customers.
%%sql WITH invoice_info AS ( SELECT il.invoice_id, t.album_id, COUNT(il.track_id) num_tracks FROM invoice_line il LEFT JOIN track t on t.track_id = il.track_id GROUP BY invoice_id, album_id ), album_info AS ( SELECT album_id, COUNT(track_id) num_tracks FROM track GROUP BY 1 ) SELECT CASE WHEN (SELECT num_tracks FROM album_info WHERE album_id = iv.album_id) = iv.num_tracks AND al.num_tracks > 2 THEN 'Album' ELSE 'Not album' END AS album_purchase, COUNT(DISTINCT invoice_id) num_invoices, ROUND(CAST(COUNT(DISTINCT invoice_id) as float) / (SELECT COUNT(DISTINCT invoice_id) FROM invoice_info) * 100, 2) percent_invoices FROM invoice_info iv LEFT JOIN album_info al on al.album_id = iv.album_id GROUP BY album_purchase LIMIT 20;
Purchases of full albums account for about 20% of total sales. Based on this, I would recommend to continue selling full albums, as this accounts for 1/5 of total purchases. Removing whole albums might signficantly cut revene as a result.
%%sql SELECT a.name artist_name, COUNT(a.name) num_tracks, g.name genre FROM artist a LEFT JOIN album al on al.artist_id = a.artist_id LEFT JOIN track t on t.album_id = al.album_id LEFT JOIN playlist_track pt on pt.track_id = t.track_id LEFT JOIN genre g on g.genre_id = t.genre_id GROUP BY artist_name ORDER BY 2 DESC LIMIT 10;
|Faith No More||145||Alternative & Punk|
%%sql SELECT a.name artist_name, COUNT(DISTINCT pt.playlist_id) num_playlists, g.name genre, COUNT(pt.track_id) num_tracks FROM artist a LEFT JOIN album al on al.artist_id = a.artist_id LEFT JOIN track t on t.album_id = al.album_id LEFT JOIN playlist_track pt on pt.track_id = t.track_id LEFT JOIN genre g on g.genre_id = t.genre_id GROUP BY artist_name ORDER BY 2 DESC LIMIT 10;
|Academy of St. Martin in the Fields & Sir Neville Marriner||6||Classical||10|
|Berliner Philharmoniker & Herbert Von Karajan||6||Classical||13|
|English Concert & Trevor Pinnock||6||Classical||9|
|The King's Singers||6||Classical||10|
|Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart||5||Classical||5|
|Adrian Leaper & Doreen de Feis||5||Classical||5|
|Alberto Turco & Nova Schola Gregoriana||5||Classical||5|
|Antal Doráti & London Symphony Orchestra||5||Classical||5|
|Barry Wordsworth & BBC Concert Orchestra||5||Classical||5|
%%sql SELECT pl.playlist_id, pl.name playlist_name, ar.name artist, COUNT(pt.track_id) num_tracks FROM playlist pl LEFT JOIN playlist_track pt on pt.playlist_id = pl.playlist_id LEFT JOIN track t on t.track_id = pt.track_id LEFT JOIN album a on a.album_id = t.album_id LEFT JOIN artist ar on ar.artist_id = a.artist_id WHERE ar.name = 'Iron Maiden' GROUP BY pt.playlist_id;
|5||90’s Music||Iron Maiden||84|
|17||Heavy Metal Classic||Iron Maiden||6|
%%sql SELECT pl.playlist_id, pl.name, COUNT(pt.track_id) FROM playlist pl LEFT JOIN playlist_track pt on pt.playlist_id = pl.playlist_id GROUP BY pl.playlist_id
|13||Classical 101 - Deep Cuts||25|
|14||Classical 101 - Next Steps||25|
|15||Classical 101 - The Basics||25|
|17||Heavy Metal Classic||26|
We can measure the most popular artists in playlists in different ways. First, we can look at which artist's tracks appear most frequently. We can also consider the number of playlists an artist appears in.
The most frequently appearing artist in playlists is Iron Maiden, whose tracks appeared 516 times. They are followed by U2, Metallica, Led Zeppelin, and Deep Purple. Of the top 10 most popular playlist artists, 4 are Rock and 2 are Metal; the other 6 are from a range of different genres. However, if we look at all the playlists in the Chinook database, we see that there are only 14 playlists with tracks added to them; their sizes range from 1 track to over 3,000 tracks. As a result, those large numbers of tracks may only be coming from several specific playlists. For example, Iron Maiden appears only on 4 playlists, including 213 tracks on two of those playlists. This may respresent big Iron Maiden fans who are adding all available Iron Maiden tracks to their playlist.
If we measure artist popularity in playlists by the number of playlists an artist appears on, we get very different results. The top 10 artists who appeared in the largest number of playlists were all classical music, appearing on 5-7 playlists, with 5-15 tracks each. Four playlists have "Classical" in their title. There may be demand for general classical music playlists, as opposed to fans of a specific classical artist.
%%sql WITH no_purchased AS ( SELECT track_id, COUNT(track_id) no_tracks FROM invoice_line GROUP BY 1 ) SELECT (SELECT COUNT(*) FROM no_purchased) tracks_purchased, (SELECT COUNT(*) FROM track) - (SELECT COUNT(*) FROM no_purchased) as tracks_not_purchased, ROUND(CAST((SELECT COUNT(*) FROM no_purchased) as float) / CAST((SELECT COUNT(*) FROM track) as float) * 100, 2) percent_purchased, ROUND(CAST((SELECT COUNT(*) FROM track) - (SELECT COUNT(*) FROM no_purchased) as float) / CAST((SELECT COUNT(*) FROM track) as float) * 100, 2) percent_not_purchased FROM track t LEFT JOIN no_purchased np on np.track_id = t.track_id GROUP BY 1;
%%sql SELECT COUNT(*) FROM track
Out of 3503 tracks available in Chinook's store, 1806 have been purchased, and 1697 have not been. Almost half of the tracks available have yet to be purchased by any customer. Chinook may want to remove some of these tracks from its inventory.
%%sql SELECT t.name track_name, ar.name artist, COUNT(il.track_id) number_sold, g.name genre FROM track t LEFT JOIN invoice_line il ON il.track_id = t.track_id LEFT JOIN album al on al.album_id = t.album_id LEFT JOIN artist ar on ar.artist_id = al.artist_id LEFT JOIN genre g on g.genre_id = t.genre_id GROUP BY 1 ORDER BY 3 DESC LIMIT 10
|Are You Experienced?||Jimi Hendrix||14||Rock|
|Highway Chile||Jimi Hendrix||14||Rock|
|Hey Joe||Jimi Hendrix||13||Rock|
|Put The Finger On You||AC/DC||13||Rock|
|Third Stone From The Sun||Jimi Hendrix||13||Rock|
|Love Or Confusion||Jimi Hendrix||12||Rock|
|Radio/Video||System Of A Down||12||Metal|
The best selling track is War Pigs by Cake, with 33 having been sold. The rest of the ten best selling tracks sold between 12 and 14 copies. 5 of the 10 best selling tracks are by Jimi Hendrix. 7 out of 10 are Rock, 2 are Metal, and 1 is Alternative. Next, we will look at the best selling artists and albums.
%%sql CREATE VIEW track_info AS SELECT il.track_id, t.name track, ar.name artist, al.title album, g.name genre, t.media_type_id FROM invoice_line il LEFT JOIN track t on t.track_id = il.track_id LEFT JOIN album al on al.album_id = t.album_id LEFT JOIN artist ar on ar.artist_id = al.artist_id LEFT JOIN genre g on g.genre_id = t.genre_id
(sqlite3.OperationalError) table track_info already exists [SQL: CREATE VIEW track_info AS SELECT il.track_id, t.name track, ar.name artist, al.title album, g.name genre, t.media_type_id FROM invoice_line il LEFT JOIN track t on t.track_id = il.track_id LEFT JOIN album al on al.album_id = t.album_id LEFT JOIN artist ar on ar.artist_id = al.artist_id LEFT JOIN genre g on g.genre_id = t.genre_id] (Background on this error at: http://sqlalche.me/e/e3q8)
%%sql SELECT artist, COUNT(track_id) tracks_sold, ROUND(CAST(COUNT(track_id) as float) / (SELECT COUNT(track_id) FROM track_info) * 100, 2) percentage_total_sales, genre FROM track_info GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
|Red Hot Chili Peppers||130||2.73||Rock|
|Guns N' Roses||124||2.61||Metal|
|The Rolling Stones||117||2.46||Rock|
The table above shows the top 10 best selling artists at Chinook. 3 of these artists also have top-10 selling tracks at Chinook: Jimi Hendrix, Nirvana, and AC/DC. Each of these top-10 artists represents 2-4% of total track purchases at the store. 8 out of 10 of these artists are from the Rock genre; the remaining 2 are Metal. Chinook may want to add more tracks by these top artists to the store, as a number of customers are fans of these artists and would likely by more of their music.
%%sql SELECT album, artist, COUNT(track_id) tracks_sold, ROUND(CAST(COUNT(track_id) as float) / (SELECT COUNT(track_id) FROM track_info) * 100, 2) percentage_total_sales, genre FROM track_info GROUP BY 1 ORDER BY 3 DESC LIMIT 10
|Are You Experienced?||Jimi Hendrix||187||3.93||Rock|
|Mezmerize||System Of A Down||93||1.96||Metal|
|Get Born||JET||90||1.89||Alternative & Punk|
|The Doors||The Doors||83||1.74||Rock|
|Greatest Hits I||Queen||80||1.68||Rock|
|The Police Greatest Hits||The Police||80||1.68||Rock|
|From The Muddy Banks Of The Wishkah [live]||Nirvana||78||1.64||Rock|
|My Generation - The Very Best Of The Who||The Who||76||1.6||Rock|
The table above shows the top selling albums, measured by total number of tracks sold from the album.
Once again, Jimi Hendrix is at the top of the list. He is Chinook's overall best selling artist, so opportunities to sell more of his music will likely be good for the business.
Rock, followed by Metal, once again dominate this list. As the top songs, albums, and artist are Rock musicians, followed by Metal musicians, Chinook may want to consider offering more music from these genres.
%%sql SELECT g.name genre, COUNT(DISTINCT t.track_id) tracks_available, ROUND(CAST(COUNT(DISTINCT t.track_id) as float) / CAST((SELECT COUNT(*) FROM track) as float) * 100, 2) percentage_tracks_avail, COUNT(DISTINCT il.track_id) track_has_sold, ROUND(CAST(COUNT(DISTINCT il.track_id) as float) / CAST(COUNT(DISTINCT t.track_id) as float) * 100, 2) as percent_has_sold, COUNT(il.track_id) number_sold, ROUND(CAST(COUNT(il.track_id) as float) / CAST((SELECT COUNT(track_id) FROM invoice_line) as float) * 100, 2) percentage_total_sales FROM track t LEFT JOIN genre g ON g.genre_id = t.genre_id LEFT JOIN invoice_line il ON il.track_id = t.track_id GROUP BY g.name ORDER BY 6 DESC
|Alternative & Punk||332||9.48||176||53.01||492||10.34|
|Rock And Roll||12||0.34||0||0.0||0||0.0|
|Sci Fi & Fantasy||26||0.74||0||0.0||0||0.0|
In general, the range of tracks available for sale was reflective of that genre's popularity.
For example, the 3 top selling genres had a larger variety of tracks available than other genres
One noteable exception to this is the Latin genre, which represented only 3% of total sales but had 579 tracks available (16% of total tracks available for purchase). Moreover, only 20% of Latin tracks available at Chinook have ever been sold. As a result, Chinook may want to consider dropping some of these tracks from its collection.
There are 7 genres that have not sold any tracks. 4 additional genres account for less than .5% of sales each. Chinook may want to consider eliminating these offerings as well.
There are 4 genres where Chinook may want to experiment with expanding its offerings: Easy Listening, Electronic/Dance, R&B/Soul, and Alternative. Each of these genres currently accounts for a small percentage of total offerings (.69-1.74%) and sales (1-3% each). However, these genres are selling a disproportionately high number and percentage of their tracks as compared to other offerings in the store. Chinook could experiment with a small expansion of offerings and/or advertising in these genres.
%%sql WITH media_info AS ( SELECT COUNT(DISTINCT track_id) num_tracks, t.media_type_id, mt.name media_type, t.track_id FROM track t LEFT JOIN media_type mt on mt.media_type_id = t.media_type_id GROUP BY 2 ) SELECT mi.media_type media_type, COUNT(ti.track_id) tracks_sold, ROUND(CAST(COUNT(ti.track_id) as float) / (SELECT COUNT(*) FROM track_info) * 100, 2) percentage_total_sold, ROUND(CAST(mi.num_tracks as float) / (SELECT COUNT(*) FROM track) * 100, 2) percentage_available_tracks, CASE WHEN mi.media_type LIKE '%Protected%' THEN 'Protected' ELSE 'Non-Protected' END AS protected FROM track_info ti LEFT JOIN media_info mi on ti.media_type_id = mi.media_type_id LEFT JOIN track t on t.track_id = ti.track_id GROUP BY 1 ORDER BY 2 DESC
|MPEG audio file||4259||89.53||86.61||Non-Protected|
|Protected AAC audio file||439||9.23||6.77||Protected|
|Purchased AAC audio file||35||0.74||0.2||Non-Protected|
|AAC audio file||21||0.44||0.31||Non-Protected|
|Protected MPEG-4 video file||3||0.06||6.11||Protected|
Out of the 5 media types available, MPEG is by far the most popular, representing 89% of tracks sold and 86% of tracks available for purchase. This file format is non-protected. About 90% of tracks sold were non-protected, compared with about 87% of tracks available for purchase. Looking at this data, it is unclear WHY this format is by far the most popular. It could be because MPEG is the most common file type for sale at Chinook or because customers specifically like MPEG files. Since the other non-protected file formats made up only a tiny portion of Chinook's sales, customers may not care if a file is protected or not when purchasing music.
In this project, we analyzed sales data from the Chinook music store's database, using SQL queries.
I gained a lot of experience writing and tweaking SQL queries. I felt my fluency with SQL improving as a moved through this project, especially as I was able to ask and answer some of my own questions on this data. I learned how to construct more complex queries peice by peice, as an iterative process. I am now more confident in understanding exactly what a query will produce, and better able to tell if it is producing what I intended vs. an unreasonable response. I am beginning to understand how and when to use WITH, VIEW and CASE better, including to loop through a dataset looking for matches to a condition. Next, I'd like to continue to gain more fluency with SQL, especially with writing more complex queries and WITH, VIEW and CASE. I'd also like to gain confidence using SQL with a larger dataset.