Introduction to SQL¶

Lecture Notes and in-class exercises¶

✨ Structured Query Language (SQL)¶

How do you pronounce SQL?¶

Although some people read SQL as es-q-el, it is most often read SQL as sequel.

Conceptual Overview¶

SQL stands for Structured Query Language. The three most important concepts are embedded in the name itself.

Structured: The data is structured. You work with fixed fields.
Query: You request information using a text query.
Language: Guess what, SQL is another programming language!

If Python was your first programming language, you can now brag about your second programming language. 😎

❓ Why use SQL? Can't we use Python (Pandas)? 🐼¶

Using Pandas can be easier in many cases unless you're familiar with SQL. Many data science professionals will agree that Pandas is a more human-friendly method (discussion here). You can perform the most common operations in SQL with Pandas (comparison with SQL here).

Kevin

Huh? When are you supposed to use SQL then? 🤨 You should use SQL when you have a large source of data on a remote location. Pandas works great... until you run out of memory (RAM). When you read data from a source, Pandas loads the entire data into memory by default. Pandas will fail if your data is too large.

Instead of loading all data into memory first, you can query a database system and retrieve only the information you need. This is almost always faster and more efficient.

⚡️ Database Management Systems (DBMS)¶

A database is a collection of data. A database management system is a software used to let users communicate with a database. There are many types of DBMS software. We will only cover relational database management systems (RDBMS). Many popular RDBMS exist, such as:

PostgreSQL
Oracle Database
MySQL
MariaDB (fork of MySQL)
SQLite

We will use SQLite, since it is the easiest one to set up.

❓ Why use databases? Can't we use CSV files?¶

This discussion is somewhat similar to why we use SQL instead of Pandas. Relational database management systems have existed since the 1970s. They are highly optimized even when your data size is humongous. Of course, the performance is dependent on the query you write even if you use an RDBMS.

Boardroom Meeting

📗 CRUD Operations Exercise using a graphical tool¶

CRUD acronym

Go to https://sqliteonline.com/ and try out these SQL queries.

CREATE: Create employees table.

CREATE TABLE IF NOT EXISTS employees (
  emp_id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  department TEXT,
  salary REAL
);

UPDATE: Add employees information.

INSERT INTO employees (name, department, salary) VALUES ('Mia', 'HR', 140000.0);
INSERT INTO employees (name, department, salary) VALUES ('Michael', 'Marketing', 200000.0);
INSERT INTO employees (name, department, salary) VALUES ('Sean', 'HR', 250000.0);
INSERT INTO employees (name, department, salary) VALUES ('Dylan', 'Marketing', 205000.0);

READ: Select emp_id and name fields from employees table.

SELECT emp_id, name
FROM employees;

READ: Select emp_id, name, and department fields of HR employees from employees table.

SELECT emp_id, name, department
FROM employees
WHERE department = "HR";

READ: Select emp_id, name, and salary fields of Marketing employees sorted by salary (highest salary first) from employees table.

SELECT emp_id, name, salary
FROM employees
WHERE department = "Marketing"
ORDER BY salary DESC;

READ: Select all rows and columns from employees table.

SELECT * FROM employees;

DELETE: Delete all HR employees information

DELETE FROM employees WHERE department = "HR";

📘 CRUD Operations using Jupyter notebook¶

▶️ First, run the code cell below to import modules used for 🧭 Check Your Work sections and the autograder.

In [1]:

import unittest
import base64
tc = unittest.TestCase()

👇 Tasks¶

✔️ Import the following Python packages.
1. pandas: Use alias pd.
2. numpy: Use alias np.
3. sqlite3: No alias

In [2]:

# YOUR CODE BEGINS
import pandas as pd
import numpy as np
import sqlite3
# YOUR CODE ENDS

🧭 Check your work¶

In [3]:

import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertTrue('sqlite3' in globals(), 'Check whether you have correctly imported the sqlite3 package.')

🎯 Create a database file and a table¶

▶️ Run the code below to create a new SQLite database file named airbnb.db and create a table named listings.

In [4]:

# Connect to airbnb.db database file
# If the file does not exist, create a new file
conn = sqlite3.connect('airbnb.db')

# Create a cursor
# A cursor enables users of a DBMS to traverse through the result set
# without having to retrieve all results at once
c = conn.cursor()

# Drop (delete) listings table if it already exists
c.execute('DROP TABLE IF EXISTS listings')
conn.commit()

# Define a query
# Triple quotes (''') denote multiline strings
create_table_query = '''
CREATE TABLE IF NOT EXISTS listings (
    neighbourhood TEXT,
    room_type TEXT,
    bathrooms REAL,
    bedrooms REAL,
    price REAL,
    number_of_reviews INTEGER,
    review_score REAL,
    is_superhost INT
)
'''

c.execute(create_table_query)
conn.commit()

conn.close()

🎯 Populate a database table from a CSV file¶

▶️ Run the code below to create a DataFrame named df_listings.

In [5]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_listings = pd.read_csv('https://github.com/bdi475/datasets/raw/main/chicago-airbnb-listings-small.csv')

display(df_listings.head(5))

	neighbourhood	room_type	bathrooms	bedrooms	price	number_of_reviews	review_score	is_superhost
0	Hyde Park	Private room	1.0	1.0	65.0	181	100.0	1
1	South Lawndale	Entire home/apt	1.0	3.0	117.0	395	96.0	1
2	West Town	Entire home/apt	1.0	3.0	70.0	389	93.0	1
3	Uptown	Private room	1.5	1.0	110.0	250	100.0	1
4	Near North Side	Private room	1.0	1.0	75.0	500	94.0	1

▶️ Run the code below to populate the listings table. All data in df_listings will be inserted to the table.

In [6]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
c = conn.cursor()

tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])
if 'listings' in tables:
    c.execute(f'DELETE FROM listings')
    conn.commit()
    
df_listings.to_sql(name='listings', index=False, con=conn, if_exists='append')

conn.close()

🧭 Check your work¶

In [7]:

# DO NOT CHANGE THE CODE IN THIS CELL
import glob
import os
user_db_files = glob.glob('*.db')

# Check if all files exist in the current directory
tc.assertTrue('airbnb.db' in user_db_files, f'Check if airbnb.db exists in {os.getcwd()}')

conn_checker = sqlite3.connect('airbnb.db')
table_to_check = 'listings'

# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your airbnb.db file!')

conn_checker.close()

🎯 Exercise 1: Select all rows and columns¶

👇 Tasks¶

✔️ Write a query that selects all columns for all rows.
✔️ Store your query in a new variable named query_select_all.

In [8]:

# YOUR CODE BEGINS
query_select_all = '''
SELECT *
FROM listings
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_select_all, con=conn)
display(df_result)

conn.close()

	neighbourhood	room_type	bathrooms	bedrooms	price	number_of_reviews	review_score	is_superhost
0	Hyde Park	Private room	1.0	1.0	65.0	181	100.0	1
1	South Lawndale	Entire home/apt	1.0	3.0	117.0	395	96.0	1
2	West Town	Entire home/apt	1.0	3.0	70.0	389	93.0	1
3	Uptown	Private room	1.5	1.0	110.0	250	100.0	1
4	Near North Side	Private room	1.0	1.0	75.0	500	94.0	1
...	...	...	...	...	...	...	...	...
821	Near North Side	Private room	1.0	1.0	97.0	140	91.0	0
822	Edgewater	Entire home/apt	1.0	1.0	104.0	101	92.0	0
823	Lake View	Entire home/apt	1.0	1.0	106.0	132	99.0	1
824	Near North Side	Private room	1.0	1.0	114.0	329	91.0	0
825	Near North Side	Private room	1.0	1.0	127.0	173	92.0	0

826 rows × 8 columns

🧭 Check your work¶

In [9]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCAqIEZST00gbGlzdGluZ3MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')

conn.close()

🎯 Exercise 2: Select only three columns for all rows¶

👇 Tasks¶

✔️ Write a query that only the following 3 columns from the listings table:
1. bathrooms
2. bedrooms
3. price
✔️ Store your query in a new variable named query_select_three_columns.

In [10]:

# YOUR CODE BEGINS
query_select_three_columns = '''
SELECT bathrooms, bedrooms, price
FROM listings
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_select_three_columns, con=conn)
display(df_result)

conn.close()

	bathrooms	bedrooms	price
0	1.0	1.0	65.0
1	1.0	3.0	117.0
2	1.0	3.0	70.0
3	1.5	1.0	110.0
4	1.0	1.0	75.0
...	...	...	...
821	1.0	1.0	97.0
822	1.0	1.0	104.0
823	1.0	1.0	106.0
824	1.0	1.0	114.0
825	1.0	1.0	127.0

826 rows × 3 columns

🧭 Check your work¶

In [11]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCBiYXRocm9vbXMsIGJlZHJvb21zLCBwcmljZSBGUk9NIGxpc3RpbmdzCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')

conn.close()

🎯 Exercise 3: Select all columns of expensive listings¶

👇 Tasks¶

✔️ Write a query that selects all columns of listings with a daily price greater than $400 in the listings table:
✔️ Store your query in a new variable named query_expensive_listings.

In [12]:

# YOUR CODE BEGINS
query_expensive_listings = '''
SELECT *
FROM listings
WHERE price > 400
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_expensive_listings, con=conn)
display(df_result)

conn.close()

	neighbourhood	room_type	bathrooms	bedrooms	price	number_of_reviews	review_score	is_superhost
0	Uptown	Entire home/apt	2.5	5.0	875.0	124	97.0	0
1	Near North Side	Entire home/apt	1.0	2.0	500.0	117	94.0	1

🧭 Check your work¶

In [13]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKV0hFUkUgcHJpY2UgPiA0MDAK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')

conn.close()

🎯 Exercise 4: Select all columns of West Town listings¶

👇 Tasks¶

✔️ Write a query that selects all columns of listings in the "West Town" neighbourhood.
- WHERE neighbourhood = 'West Town'
✔️ Store your query in a new variable named query_west_town_listings.

In [14]:

# YOUR CODE BEGINS
query_west_town_listings = '''
SELECT *
FROM listings
WHERE neighbourhood = 'West Town'
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_west_town_listings, con=conn)
display(df_result)

conn.close()

	neighbourhood	room_type	bathrooms	bedrooms	price	number_of_reviews	review_score	is_superhost
0	West Town	Entire home/apt	1.0	3.0	70.0	389	93.0	1
1	West Town	Entire home/apt	3.0	3.0	168.0	218	97.0	1
2	West Town	Entire home/apt	1.0	2.0	75.0	189	91.0	0
3	West Town	Private room	2.0	1.0	57.0	127	89.0	0
4	West Town	Private room	2.0	1.0	78.0	143	90.0	0
...	...	...	...	...	...	...	...	...
133	West Town	Private room	1.0	1.0	55.0	126	100.0	1
134	West Town	Entire home/apt	1.0	1.0	95.0	125	98.0	1
135	West Town	Entire home/apt	1.0	2.0	68.0	112	93.0	0
136	West Town	Entire home/apt	1.0	2.0	58.0	140	99.0	1
137	West Town	Entire home/apt	1.0	3.0	64.0	134	98.0	1

138 rows × 8 columns

🧭 Check your work¶

In [15]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKV0hFUkUgbmVpZ2hib3VyaG9vZCA9ICdXZXN0IFRvd24nCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')

conn.close()

🎯 Exercise 5: Sort all listings by price¶

👇 Tasks¶

✔️ Write a query that selects all listings and sorts the result by price in ascending order.
✔️ Store your query in a new variable named query_sort_by_price.

In [16]:

# YOUR CODE BEGINS
query_sort_by_price = '''
SELECT *
FROM listings
ORDER BY price
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_sort_by_price, con=conn)
display(df_result)

conn.close()

	neighbourhood	room_type	bathrooms	bedrooms	price	number_of_reviews	review_score	is_superhost
0	West Ridge	Shared room	2.0	1.0	13.0	153	95.0	0
1	North Lawndale	Shared room	1.0	1.0	14.0	118	91.0	0
2	West Ridge	Shared room	2.0	1.0	16.0	158	92.0	0
3	West Ridge	Shared room	2.0	1.0	16.0	124	95.0	0
4	Lower West Side	Private room	2.0	1.0	19.0	112	94.0	1
...	...	...	...	...	...	...	...	...
821	Near North Side	Entire home/apt	2.0	2.0	344.0	128	94.0	1
822	West Town	Entire home/apt	3.0	3.0	379.0	210	98.0	0
823	Lake View	Entire home/apt	2.0	5.0	400.0	113	99.0	1
824	Near North Side	Entire home/apt	1.0	2.0	500.0	117	94.0	1
825	Uptown	Entire home/apt	2.5	5.0	875.0	124	97.0	0

826 rows × 8 columns

🧭 Check your work¶

In [17]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKT1JERVIgQlkgcHJpY2UK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')

conn.close()

🎯 Exercise 6: Select only the price column and sort in descending order¶

👇 Tasks¶

✔️ Write a query that selects only the price column and sorts the result in descending order.
✔️ Store your query in a new variable named query_sort_by_price_desc.

In [18]:

# YOUR CODE BEGINS
query_sort_by_price_desc = '''
SELECT price
FROM listings
ORDER BY price DESC
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_sort_by_price_desc, con=conn)
display(df_result)

conn.close()

	price
0	875.0
1	500.0
2	400.0
3	379.0
4	344.0
...	...
821	19.0
822	16.0
823	16.0
824	14.0
825	13.0

826 rows × 1 columns

🧭 Check your work¶

In [19]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCBwcmljZQpGUk9NIExJU1RJTkdTCk9SREVSIEJZIHByaWNlIERFU0MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_series_equal(df_result['price'].reset_index(drop=True),
                               df_check['price'].reset_index(drop=True))

conn.close()

🎯 Exercise 7: A more complicated query¶

👇 Tasks¶

✔️ Write a query that:
- Selects room_type, bathrooms, bedrooms, price, and number_of_reviews columns (in the same order).
- Where price is less than 100 AND the number of reviews is greather than 500.
- Sorts by number_of_reviews in descending order.
✔️ Store your query in a new variable named query_final.

In [20]:

# YOUR CODE BEGINS
query_final = '''
SELECT room_type, bathrooms, bedrooms, price, number_of_reviews
FROM listings
WHERE (price < 100) AND (number_of_reviews > 500)
ORDER BY number_of_reviews DESC
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb.db')

df_result = pd.read_sql_query(query_final, con=conn)
display(df_result)

conn.close()

	room_type	bathrooms	bedrooms	price	number_of_reviews
0	Private room	1.0	1.0	80.0	658
1	Entire home/apt	1.0	1.0	39.0	642
2	Entire home/apt	1.0	1.0	85.0	625
3	Hotel room	1.0	1.0	74.0	607
4	Entire home/apt	1.0	1.0	75.0	543
5	Entire home/apt	1.0	2.0	77.0	515
6	Entire home/apt	1.0	1.0	65.0	504

🧭 Check your work¶

In [21]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')

decoded_query = base64.b64decode(b'ClNFTEVDVCByb29tX3R5cGUsIGJhdGhyb29tcywgYmVkcm9vbXMsIHByaWNlLCBudW1iZXJfb2ZfcmV2aWV3cwpGUk9NIExJU1RJTkdTCldIRVJFIChwcmljZSA8IDEwMCkgQU5EIChudW1iZXJfb2ZfcmV2aWV3cyA+IDUwMCkKT1JERVIgQlkgbnVtYmVyX29mX3Jldmlld3MgREVTQwo=').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)

tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.reset_index(drop=True),
                               df_check.reset_index(drop=True))

conn.close()

✨ Aggregating Data¶

Like Pandas, you can aggregate data across multiple columns or tables.

📌 Create a second database file and a table¶

▶️ Run the code below to create a new SQLite database file named airbnb.db and create a table named listings.

In [22]:

# Connect to airbnb.db database file
# If the file does not exist, create a new file
conn = sqlite3.connect('airbnb-boston-medium.db')

# Create a cursor
# A cursor enables users of a DBMS to traverse through the result set
# without having to retrieve all results at once
c = conn.cursor()

# Drop (delete) listings table if it already exists
c.execute('DROP TABLE IF EXISTS listings')
conn.commit()

# Define a query
# Triple quotes (''') denote multiline strings
create_table_query = '''
CREATE TABLE IF NOT EXISTS listings (
    name TEXT,
    neighbourhood TEXT,
    room_type TEXT,
    bedrooms REAL,
    bathrooms REAL,
    price REAL,
    minimum_nights INT,
    availability_365 INT,
    number_of_reviews INTEGER,
    review_score REAL,
    is_superhost INT
)
'''

c.execute(create_table_query)
conn.commit()

conn.close()

🧭 Check your work¶

In [23]:

# DO NOT CHANGE THE CODE BELOW
import glob
import os
user_db_files = glob.glob('*.db')

# Check if airbnb.db file exists in current working directory
tc.assertTrue('airbnb-boston-medium.db' in user_db_files, f'Check if airbnb-boston-medium.db exists in {os.getcwd()}')

📌 Populate a database table from a CSV file¶

▶️ Run the code below to create a DataFrame named df_listings.

In [24]:

df_listings = pd.read_csv('https://github.com/bdi475/datasets/raw/main/boston-airbnb-listings-small.csv')
df_listings_backup = df_listings.copy()

display(df_listings.head(5))

	name	neighbourhood	room_type	bedrooms	bathrooms	minimum_nights	price	availability_365	number_of_reviews	review_score	is_superhost
0	HARBORSIDE-Walk to subway	East Boston	Entire home/apt	1.0	1.0	28	150.0	123	17	99.0	0
1	$49 Special Private! Minutes to center!	Roxbury	Entire home/apt	1.0	1.0	3	145.0	0	107	95.0	1
2	$99 Special!! Home Away! Condo	Roxbury	Entire home/apt	1.0	1.0	3	169.0	0	115	96.0	1
3	Bright 1bed facing Golden Dome	Downtown	Entire home/apt	1.0	1.0	91	81.0	281	32	96.0	1
4	The Dorset Redline \| 3BR 1BA \| Walk to Redline...	Dorchester	Entire home/apt	3.0	1.0	32	129.0	103	52	86.0	0

▶️ Run the code below to populate the listings table. All data in df_listings will be inserted to the table.

In [25]:

conn = sqlite3.connect('airbnb-boston-medium.db')
c = conn.cursor()

tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])

if 'listings' in tables:
    c.execute(f'DELETE FROM listings')
    conn.commit()
    
df_listings.to_sql(name='listings', index=False, con=conn, if_exists='append')

conn.close()

🧭 Check your work¶

In [26]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn_checker = sqlite3.connect('airbnb-boston-medium.db')
table_to_check = 'listings'

# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your airbnb-boston-medium.db file!')

conn_checker.close()

▶️ Run the code below to display df_listings.

In [27]:

display(df_listings)

	name	neighbourhood	room_type	bedrooms	bathrooms	minimum_nights	price	availability_365	number_of_reviews	review_score	is_superhost
0	HARBORSIDE-Walk to subway	East Boston	Entire home/apt	1.0	1.0	28	150.0	123	17	99.0	0
1	$49 Special Private! Minutes to center!	Roxbury	Entire home/apt	1.0	1.0	3	145.0	0	107	95.0	1
2	$99 Special!! Home Away! Condo	Roxbury	Entire home/apt	1.0	1.0	3	169.0	0	115	96.0	1
3	Bright 1bed facing Golden Dome	Downtown	Entire home/apt	1.0	1.0	91	81.0	281	32	96.0	1
4	The Dorset Redline \| 3BR 1BA \| Walk to Redline...	Dorchester	Entire home/apt	3.0	1.0	32	129.0	103	52	86.0	0
...	...	...	...	...	...	...	...	...	...	...	...
1334	Quiet and Sunny Top Floor Studio! South End LOVE!	South End	Entire home/apt	1.0	1.0	1	92.0	328	14	77.0	0
1335	FOUND Boston Common - Standard Full Room	Bay Village	Private room	1.0	1.0	1	100.0	253	53	90.0	0
1336	FOUND Boston Common - Queen Room	Downtown	Private room	1.0	1.0	1	114.0	254	49	89.0	0
1337	FOUND Boston Common - Double Queen Room	Downtown	Private room	1.0	1.0	1	143.0	252	43	93.0	0
1338	FOUND Boston Common - Single Room	Downtown	Private room	1.0	1.0	1	87.0	254	26	88.0	0

1339 rows × 11 columns

🎯 Exercise 8: Select all columns and rows from the `listings` table (SQL)¶

👇 Tasks¶

✔️ Write a query that selects all columns for all rows.
✔️ Store your query to a new variable named query_select_all.

In [28]:

# YOUR CODE BEGINS
query_select_all = '''
SELECT *
FROM listings
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_select_all, con=conn)
display(df_result)
conn.close()

	name	neighbourhood	room_type	bedrooms	bathrooms	price	minimum_nights	availability_365	number_of_reviews	review_score	is_superhost
0	HARBORSIDE-Walk to subway	East Boston	Entire home/apt	1.0	1.0	150.0	28	123	17	99.0	0
1	$49 Special Private! Minutes to center!	Roxbury	Entire home/apt	1.0	1.0	145.0	3	0	107	95.0	1
2	$99 Special!! Home Away! Condo	Roxbury	Entire home/apt	1.0	1.0	169.0	3	0	115	96.0	1
3	Bright 1bed facing Golden Dome	Downtown	Entire home/apt	1.0	1.0	81.0	91	281	32	96.0	1
4	The Dorset Redline \| 3BR 1BA \| Walk to Redline...	Dorchester	Entire home/apt	3.0	1.0	129.0	32	103	52	86.0	0
...	...	...	...	...	...	...	...	...	...	...	...
1334	Quiet and Sunny Top Floor Studio! South End LOVE!	South End	Entire home/apt	1.0	1.0	92.0	1	328	14	77.0	0
1335	FOUND Boston Common - Standard Full Room	Bay Village	Private room	1.0	1.0	100.0	1	253	53	90.0	0
1336	FOUND Boston Common - Queen Room	Downtown	Private room	1.0	1.0	114.0	1	254	49	89.0	0
1337	FOUND Boston Common - Double Queen Room	Downtown	Private room	1.0	1.0	143.0	1	252	43	93.0	0
1338	FOUND Boston Common - Single Room	Downtown	Private room	1.0	1.0	87.0	1	254	26	88.0	0

1339 rows × 11 columns

🧭 Check your work¶

In [29]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gbGlzdGluZ3MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()

🎯 Exercise 9A: Find expensive listings (Pandas)¶

👇 Tasks¶

✔️ Using df_listings, find all rows with a price of over $1,000 (df_listings['price'] > 1000).
✔️ Store the result to a new variable named df_expensive_listings.

🔑 Expected Output¶

Your index column may contain different values.

	name	neighbourhood	room_type	bedrooms	bathrooms	minimum_nights	price	availability_365	number_of_reviews	review_score	is_superhost
502	Large 4BDR near Harvard with Parking	Allston	Entire home/apt	4	2.5	1	1250	0	136	98	0
669	The Historic House in the North End/Waterfront	North End	Entire home/apt	3	2	91	1052	0	104	98	1

In [30]:

# YOUR CODE BEGINS
df_expensive_listings = df_listings[df_listings['price'] > 1000]
# YOUR CODE ENDS

display(df_expensive_listings)

	name	neighbourhood	room_type	bedrooms	bathrooms	minimum_nights	price	availability_365	number_of_reviews	review_score	is_superhost
502	Large 4BDR near Harvard with Parking	Allston	Entire home/apt	4.0	2.5	1	1250.0	0	136	98.0	0
669	The Historic House in the North End/Waterfront	North End	Entire home/apt	3.0	2.0	91	1052.0	0	104	98.0	1

🧭 Check your work¶

In [31]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.query(f"{'PrIcE'.lower()} > {25 * 40}")

pd.testing.assert_frame_equal(df_expensive_listings.sort_values(df_expensive_listings.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

🎯 Exercise 9B: Find expensive listings (SQL)¶

👇 Tasks¶

✔️ Using the listings table, select rows with a price of over $1,000 (price > 1000).
✔️ Select all columns.
✔️ Store your query to a new variable named query_expensive_listings.

In [32]:

# YOUR CODE BEGINS
query_expensive_listings = '''
SELECT *
FROM listings
WHERE price > 1000
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_expensive_listings, con=conn)
display(df_result)
conn.close()

	name	neighbourhood	room_type	bedrooms	bathrooms	price	minimum_nights	availability_365	number_of_reviews	review_score	is_superhost
0	Large 4BDR near Harvard with Parking	Allston	Entire home/apt	4.0	2.5	1250.0	1	0	136	98.0	0
1	The Historic House in the North End/Waterfront	North End	Entire home/apt	3.0	2.0	1052.0	91	0	104	98.0	1

🧭 Check your work¶

In [33]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gbGlzdGluZ3MKV0hFUkUgcHJpY2UgPiAxMDAwCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()

🎯 Exercise 10A: Find large listings (Pandas)¶

👇 Tasks¶

✔️ Using df_listings, find all rows where the listing has:
- 5 or more bedrooms
- AND 3 or more bathrooms
✔️ Select only the following 5 columns (in the same order):
- name, bedrooms, bathrooms, price, review_score
✔️ Store the result to a new variable named df_large_listings.

🔑 Expected Output¶

Your index column may contain different values.

	name	bedrooms	bathrooms	price	review_score
151	Beautiful Philadelphia house near Harvard U. w/ pk	5	3	600	82
250	Spacious 4 BR	2.5 BA Single-Family Home.	5	3	373	97
921	★Large Retreat 5BR w/3BA★ Close to Everything	5	3	738	95

In [34]:

# YOUR CODE BEGINS
df_large_listings = df_listings[(df_listings['bedrooms'] >= 5) & (df_listings['bathrooms'] >= 3)]
df_large_listings = df_large_listings[['name', 'bedrooms', 'bathrooms', 'price', 'review_score']]
# YOUR CODE ENDS

display(df_large_listings)

	name	bedrooms	bathrooms	price	review_score
151	Beautiful Philadelphia house near Harvard U. w...	5.0	3.0	600.0	82.0
250	Spacious 4 BR \| 2.5 BA Single-Family Home.	5.0	3.0	373.0	97.0
921	★Large Retreat 5BR w/3BA★ Close to Everything	5.0	3.0	738.0	95.0

🧭 Check your work¶

In [35]:

# DO NOT CHANGE THE CODE IN THIS CELL
__r = 'oo'.join(['R', 'Ms']).lower()
df_check = df_listings.query(f'bed{__r} > {2 ** 2} & bath{__r} > {2 ** 1}') \
    [['_'.join(['review', 'score']), 'price', 'bath' + __r, 'bed' + __r, 'name'][::-1]]

pd.testing.assert_frame_equal(df_large_listings.sort_values(df_large_listings.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

🎯 Exercise 10B: Find large listings (SQL)¶

👇 Tasks¶

✔️ Using the listings table, find all rows where the listing has:
- 5 or more bedrooms
- AND 3 or more bathrooms
✔️ Select only the following 5 columns (in the same order):
- name, bedrooms, bathrooms, price, review_score
✔️ Store your query to a new variable named query_large_listings.

In [36]:

# YOUR CODE BEGINS
query_large_listings = '''
SELECT name, bedrooms, bathrooms, price, review_score
FROM listings
WHERE (bedrooms >= 5) AND (bathrooms >= 3)
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_large_listings, con=conn)
display(df_result)
conn.close()

	name	bedrooms	bathrooms	price	review_score
0	Beautiful Philadelphia house near Harvard U. w...	5.0	3.0	600.0	82.0
1	Spacious 4 BR \| 2.5 BA Single-Family Home.	5.0	3.0	373.0	97.0
2	★Large Retreat 5BR w/3BA★ Close to Everything	5.0	3.0	738.0	95.0

🧭 Check your work¶

In [37]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBuYW1lLCBiZWRyb29tcywgYmF0aHJvb21zLCBwcmljZSwgcmV2aWV3X3Njb3JlCkZST00gbGlzdGluZ3MKV0hFUkUgKGJlZHJvb21zID49IDUpIEFORCAoYmF0aHJvb21zID49IDMpCg=='
).decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.columns.tolist(), df_check.columns.tolist(), 'Incorrect set of columns or order')
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()

🎯 Exercise 11A: Number of listings (Pandas)¶

👇 Tasks¶

✔️ Find the number of rows in df_listings.
✔️ Store the result to a new variable named num_listings.
✔️ num_listings should be an integer type.

In [38]:

# YOUR CODE BEGINS
num_listings = df_listings.shape[0]
# YOUR CODE ENDS

print(f'There are {num_listings} AirBnB listings in the dataset.')

There are 1339 AirBnB listings in the dataset.

🧭 Check your work¶

In [39]:

# DO NOT CHANGE THE CODE IN THIS CELL
tc.assertEqual(num_listings, len(df_listings_backup.index), f'Incorrect number of listings - should be {len(df_listings_backup.index)}')

🎯 Exercise 11B: Number of listings (SQL)¶

👇 Tasks¶

✔️ Write a query that counts the number of rows in the listings table.
✔️ Store your query in a new variable named query_num_listings.

In [40]:

# YOUR CODE BEGINS
query_num_listings = '''
SELECT COUNT(*)
FROM listings
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_num_listings, con=conn)
display(df_result)
conn.close()

	COUNT(*)
0	1339

🧭 Check your work¶

In [41]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBDT1VOVCgqKQpGUk9NIGxpc3RpbmdzCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.iloc[0, 0], df_check.iloc[0, 0], 'Incorrect number of listings')
conn.close()

🎯 Exercise 12A: Get the number of listings by neighbourhood (Pandas)¶

👇 Tasks¶

✔️ Using df_listings, find the number of listings by neighbourhood.
✔️ Store the result to a new variable named df_by_neighbourhood.
✔️ df_by_neighbourhood should have the following two columns.
1. neighbourhood: Name of the neighbourhood
2. num_listings: Number of listings in a neighbourhood
✔️ print(df_by_neighbourhood.columns.tolist()) should print ['neighbourhood', 'num_listings'].
✔️ Sort df_by_neighbourhood by num_listings in descending order.

🔑 Expected Output¶

Your index column may contain different values.

	neighbourhood	num_listings
7	Dorchester	215
12	Jamaica Plain	129
19	Roxbury	107
9	East Boston	90
22	South End	86

In [42]:

# YOUR CODE BEGINS
df_by_neighbourhood = df_listings.groupby('neighbourhood', as_index=False).agg({
    'name': 'count'
})

df_by_neighbourhood.rename(columns={'name': 'num_listings'}, inplace=True)
df_by_neighbourhood.sort_values('num_listings', ascending=False, inplace=True)
# YOUR CODE ENDS

df_by_neighbourhood.head(5)

Out[42]:

	neighbourhood	num_listings
7	Dorchester	215
12	Jamaica Plain	129
19	Roxbury	107
9	East Boston	90
22	South End	86

🧭 Check your work¶

In [43]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.pivot_table(index=['neigh' + 'BoUr'.lower() + 'HooD'.lower()],
                                          values=['NaMe'.lower()],
                                          aggfunc=('CoU' + 'Nt').lower()) \
                                .rename(columns={'NaME'.lower(): '_'.join(['NuM', 'LiSTinGs']).lower()}) \
                                .reset_index().sort_values('_'.join(['NuM', 'LiSTinGs']).lower(), ascending=bool(0))

pd.testing.assert_series_equal(df_by_neighbourhood['num_listings'].reset_index(drop=True),
                               df_check['num_listings'].reset_index(drop=True))

🚀 Solution¶

df_by_neighbourhood = df_listings.groupby('neighbourhood', as_index=False).agg({
    'name': 'count'
})

df_by_neighbourhood.rename(columns={'name': 'num_listings'}, inplace=True)
df_by_neighbourhood.sort_values('num_listings', ascending=False, inplace=True)

🎯 Exercise 12B: Get the number of listings by neighbourhood (SQL)¶

👇 Tasks¶

✔️ Write a query that counts the number of listings by neighbourhood from the listings table.
✔️ The result of the query should have the following two columns.
1. neighbourhood: Name of the neighbourhood
2. num_listings: Number of listings in a neighbourhood
✔️ Store your query in a new variable named query_num_listings_by_neighbourhood.

In [44]:

# YOUR CODE BEGINS
query_num_listings_by_neighbourhood = '''
SELECT neighbourhood, COUNT(*) AS num_listings
FROM listings
GROUP BY neighbourhood
ORDER BY num_listings DESC
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_num_listings_by_neighbourhood, con=conn)
display(df_result)
conn.close()

	neighbourhood	num_listings
0	Dorchester	215
1	Jamaica Plain	129
2	Roxbury	107
3	East Boston	90
4	South End	86
5	Downtown	82
6	Allston	82
7	South Boston	67
8	Brighton	66
9	Beacon Hill	54
10	Back Bay	50
11	Charlestown	47
12	Fenway	42
13	Roslindale	38
14	North End	33
15	Mattapan	30
16	Hyde Park	25
17	West Roxbury	24
18	Mission Hill	23
19	Bay Village	22
20	West End	12
21	Chinatown	7
22	South Boston Waterfront	6
23	Longwood Medical Area	1
24	Leather District	1

🧭 Check your work¶

In [45]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBuZWlnaGJvdXJob29kLCBDT1VOVCgqKSBBUyBudW1fbGlzdGluZ3MKRlJPTSBsaXN0aW5ncwpHUk9VUCBCWSBuZWlnaGJvdXJob29kCk9SREVSIEJZIG51bV9saXN0aW5ncyBERVNDCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_series_equal(df_result['num_listings'].reset_index(drop=True),
                               df_result['num_listings'].reset_index(drop=True))
conn.close()

🚀 Solution¶

SELECT neighbourhood, COUNT(*) AS num_listings
FROM listings
GROUP BY neighbourhood
ORDER BY num_listings DESC

🎯 Exercise 13A: Average price by room type (Pandas)¶

👇 Tasks¶

✔️ Using df_listings, find the average price by room type.
✔️ Store the result to a new variable named df_price_by_room_type.
✔️ df_price_by_room_type should have the following two columns.
1. room_type: Room type
2. price: Average price of the room type
✔️ print(df_price_by_room_type.columns.tolist()) should print ['room_type', 'price'].
✔️ Sort df_price_by_room_type by price in ascending order.

🔑 Expected Output¶

Your index column may contain different values.

	room_type	price
3	Shared room	34.75
2	Private room	82.2278
0	Entire home/apt	201.135
1	Hotel room	206.273

In [46]:

# YOUR CODE BEGINS
df_price_by_room_type = df_listings.groupby('room_type', as_index=False).agg({
    'price': 'mean'
})

df_price_by_room_type.sort_values('price', inplace=True)

# YOUR CODE ENDS

df_price_by_room_type

Out[46]:

	room_type	price
3	Shared room	34.750000
2	Private room	82.227848
0	Entire home/apt	201.134890
1	Hotel room	206.272727

🧭 Check your work¶

In [47]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.pivot_table(index=['RoOm_'.lower() + 'TyPe'.lower()],
                                          values=['price'],
                                          aggfunc=np.mean) \
                                .reset_index() \
                                .sort_values('price', ascending=bool(not False))

pd.testing.assert_frame_equal(df_price_by_room_type.sort_values(df_price_by_room_type.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

🎯 Exercise 13B: Average price by room type (SQL)¶

👇 Tasks¶

✔️ Write a query that finds the average price by room type from the listings table.
✔️ Store your query to a new variable named query_price_by_room_type.
✔️ Your result should have the following two columns.
1. room_type: Room type
2. price: Average price of the room type
✔️ Sort your result by price in ascending order.

In [48]:

# YOUR CODE BEGINS
query_price_by_room_type = '''
SELECT room_type, AVG(price) as price
FROM listings
GROUP BY room_type
ORDER BY price
'''
# YOUR CODE ENDS

conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_price_by_room_type, con=conn)
display(df_result)
conn.close()

	room_type	price
0	Shared room	34.750000
1	Private room	82.227848
2	Entire home/apt	201.134890
3	Hotel room	206.272727

🧭 Check your work¶

In [49]:

# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCByb29tX3R5cGUsIEFWRyhwcmljZSkgYXMgcHJpY2UKRlJPTSBsaXN0aW5ncwpHUk9VUCBCWSByb29tX3R5cGUKT1JERVIgQlkgcHJpY2UK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

conn.close()

🚀 Solution¶

SELECT room_type, AVG(price) as price
FROM listings
GROUP BY room_type
ORDER BY price

Introduction to SQL¶

Lecture Notes and in-class exercises¶

✨ Structured Query Language (SQL)¶

How do you pronounce SQL?¶

Conceptual Overview¶

❓ Why use SQL? Can't we use Python (Pandas)? 🐼¶

⚡️ Database Management Systems (DBMS)¶

❓ Why use databases? Can't we use CSV files?¶

📗 CRUD Operations Exercise using a graphical tool¶

📘 CRUD Operations using Jupyter notebook¶

👇 Tasks¶

🧭 Check your work¶

🎯 Create a database file and a table¶

🎯 Populate a database table from a CSV file¶

🧭 Check your work¶

🎯 Exercise 1: Select all rows and columns¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 2: Select only three columns for all rows¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 3: Select all columns of expensive listings¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 4: Select all columns of West Town listings¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 5: Sort all listings by price¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 6: Select only the price column and sort in descending order¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 7: A more complicated query¶

👇 Tasks¶

🧭 Check your work¶

✨ Aggregating Data¶

📌 Create a second database file and a table¶

🧭 Check your work¶

📌 Populate a database table from a CSV file¶

🧭 Check your work¶

🎯 Exercise 8: Select all columns and rows from the listings table (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 9A: Find expensive listings (Pandas)¶

👇 Tasks¶

🔑 Expected Output¶

🧭 Check your work¶

🎯 Exercise 9B: Find expensive listings (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 10A: Find large listings (Pandas)¶

👇 Tasks¶

🔑 Expected Output¶

🧭 Check your work¶

🎯 Exercise 10B: Find large listings (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 11A: Number of listings (Pandas)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 11B: Number of listings (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 12A: Get the number of listings by neighbourhood (Pandas)¶

👇 Tasks¶

🔑 Expected Output¶

🧭 Check your work¶

🚀 Solution¶

🎯 Exercise 12B: Get the number of listings by neighbourhood (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🚀 Solution¶

🎯 Exercise 13A: Average price by room type (Pandas)¶

👇 Tasks¶

🔑 Expected Output¶

🧭 Check your work¶

🎯 Exercise 13B: Average price by room type (SQL)¶

👇 Tasks¶

🧭 Check your work¶

🎯 Exercise 8: Select all columns and rows from the `listings` table (SQL)¶