Although some people read SQL as es-q-el, it is most often read SQL as sequel.
SQL stands for Structured Query Language. The three most important concepts are embedded in the name itself.
If Python was your first programming language, you can now brag about your second programming language. 😎
Using Pandas can be easier in many cases unless you're familiar with SQL. Many data science professionals will agree that Pandas is a more human-friendly method (discussion here). You can perform the most common operations in SQL with Pandas (comparison with SQL here).
Huh? When are you supposed to use SQL then? 🤨 You should use SQL when you have a large source of data on a remote location. Pandas works great... until you run out of memory (RAM). When you read data from a source, Pandas loads the entire data into memory by default. Pandas will fail if your data is too large.
Instead of loading all data into memory first, you can query a database system and retrieve only the information you need. This is almost always faster and more efficient.
A database is a collection of data. A database management system is a software used to let users communicate with a database. There are many types of DBMS software. We will only cover relational database management systems (RDBMS). Many popular RDBMS exist, such as:
We will use SQLite, since it is the easiest one to set up.
This discussion is somewhat similar to why we use SQL instead of Pandas. Relational database management systems have existed since the 1970s. They are highly optimized even when your data size is humongous. Of course, the performance is dependent on the query you write even if you use an RDBMS.
Go to https://sqliteonline.com/ and try out these SQL queries.
CREATE: Create employees
table.
CREATE TABLE IF NOT EXISTS employees (
emp_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
department TEXT,
salary REAL
);
UPDATE: Add employees information.
INSERT INTO employees (name, department, salary) VALUES ('Mia', 'HR', 140000.0);
INSERT INTO employees (name, department, salary) VALUES ('Michael', 'Marketing', 200000.0);
INSERT INTO employees (name, department, salary) VALUES ('Sean', 'HR', 250000.0);
INSERT INTO employees (name, department, salary) VALUES ('Dylan', 'Marketing', 205000.0);
READ: Select emp_id
and name
fields from employees
table.
SELECT emp_id, name
FROM employees;
READ: Select emp_id
, name
, and department
fields of HR employees from employees
table.
SELECT emp_id, name, department
FROM employees
WHERE department = "HR";
READ: Select emp_id
, name
, and salary
fields of Marketing employees sorted by salary
(highest salary first) from employees
table.
SELECT emp_id, name, salary
FROM employees
WHERE department = "Marketing"
ORDER BY salary DESC;
READ: Select all rows and columns from employees
table.
SELECT * FROM employees;
DELETE: Delete all HR employees information
DELETE FROM employees WHERE department = "HR";
▶️ First, run the code cell below to import modules used for 🧭 Check Your Work sections and the autograder.
import unittest
import base64
tc = unittest.TestCase()
pandas
: Use alias pd
.numpy
: Use alias np
.sqlite3
: No alias# YOUR CODE BEGINS
import pandas as pd
import numpy as np
import sqlite3
# YOUR CODE ENDS
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertTrue('sqlite3' in globals(), 'Check whether you have correctly imported the sqlite3 package.')
▶️ Run the code below to create a new SQLite database file named airbnb.db
and create a table named listings
.
# Connect to airbnb.db database file
# If the file does not exist, create a new file
conn = sqlite3.connect('airbnb.db')
# Create a cursor
# A cursor enables users of a DBMS to traverse through the result set
# without having to retrieve all results at once
c = conn.cursor()
# Drop (delete) listings table if it already exists
c.execute('DROP TABLE IF EXISTS listings')
conn.commit()
# Define a query
# Triple quotes (''') denote multiline strings
create_table_query = '''
CREATE TABLE IF NOT EXISTS listings (
neighbourhood TEXT,
room_type TEXT,
bathrooms REAL,
bedrooms REAL,
price REAL,
number_of_reviews INTEGER,
review_score REAL,
is_superhost INT
)
'''
c.execute(create_table_query)
conn.commit()
conn.close()
▶️ Run the code below to create a DataFrame named df_listings
.
# DO NOT CHANGE THE CODE IN THIS CELL
df_listings = pd.read_csv('https://github.com/bdi475/datasets/raw/main/chicago-airbnb-listings-small.csv')
display(df_listings.head(5))
neighbourhood | room_type | bathrooms | bedrooms | price | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|
0 | Hyde Park | Private room | 1.0 | 1.0 | 65.0 | 181 | 100.0 | 1 |
1 | South Lawndale | Entire home/apt | 1.0 | 3.0 | 117.0 | 395 | 96.0 | 1 |
2 | West Town | Entire home/apt | 1.0 | 3.0 | 70.0 | 389 | 93.0 | 1 |
3 | Uptown | Private room | 1.5 | 1.0 | 110.0 | 250 | 100.0 | 1 |
4 | Near North Side | Private room | 1.0 | 1.0 | 75.0 | 500 | 94.0 | 1 |
▶️ Run the code below to populate the listings
table. All data in df_listings
will be inserted to the table.
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
c = conn.cursor()
tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])
if 'listings' in tables:
c.execute(f'DELETE FROM listings')
conn.commit()
df_listings.to_sql(name='listings', index=False, con=conn, if_exists='append')
conn.close()
# DO NOT CHANGE THE CODE IN THIS CELL
import glob
import os
user_db_files = glob.glob('*.db')
# Check if all files exist in the current directory
tc.assertTrue('airbnb.db' in user_db_files, f'Check if airbnb.db exists in {os.getcwd()}')
conn_checker = sqlite3.connect('airbnb.db')
table_to_check = 'listings'
# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your airbnb.db file!')
conn_checker.close()
# YOUR CODE BEGINS
query_select_all = '''
SELECT *
FROM listings
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_select_all, con=conn)
display(df_result)
conn.close()
neighbourhood | room_type | bathrooms | bedrooms | price | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|
0 | Hyde Park | Private room | 1.0 | 1.0 | 65.0 | 181 | 100.0 | 1 |
1 | South Lawndale | Entire home/apt | 1.0 | 3.0 | 117.0 | 395 | 96.0 | 1 |
2 | West Town | Entire home/apt | 1.0 | 3.0 | 70.0 | 389 | 93.0 | 1 |
3 | Uptown | Private room | 1.5 | 1.0 | 110.0 | 250 | 100.0 | 1 |
4 | Near North Side | Private room | 1.0 | 1.0 | 75.0 | 500 | 94.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
821 | Near North Side | Private room | 1.0 | 1.0 | 97.0 | 140 | 91.0 | 0 |
822 | Edgewater | Entire home/apt | 1.0 | 1.0 | 104.0 | 101 | 92.0 | 0 |
823 | Lake View | Entire home/apt | 1.0 | 1.0 | 106.0 | 132 | 99.0 | 1 |
824 | Near North Side | Private room | 1.0 | 1.0 | 114.0 | 329 | 91.0 | 0 |
825 | Near North Side | Private room | 1.0 | 1.0 | 127.0 | 173 | 92.0 | 0 |
826 rows × 8 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqIEZST00gbGlzdGluZ3MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
# YOUR CODE BEGINS
query_select_three_columns = '''
SELECT bathrooms, bedrooms, price
FROM listings
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_select_three_columns, con=conn)
display(df_result)
conn.close()
bathrooms | bedrooms | price | |
---|---|---|---|
0 | 1.0 | 1.0 | 65.0 |
1 | 1.0 | 3.0 | 117.0 |
2 | 1.0 | 3.0 | 70.0 |
3 | 1.5 | 1.0 | 110.0 |
4 | 1.0 | 1.0 | 75.0 |
... | ... | ... | ... |
821 | 1.0 | 1.0 | 97.0 |
822 | 1.0 | 1.0 | 104.0 |
823 | 1.0 | 1.0 | 106.0 |
824 | 1.0 | 1.0 | 114.0 |
825 | 1.0 | 1.0 | 127.0 |
826 rows × 3 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBiYXRocm9vbXMsIGJlZHJvb21zLCBwcmljZSBGUk9NIGxpc3RpbmdzCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
# YOUR CODE BEGINS
query_expensive_listings = '''
SELECT *
FROM listings
WHERE price > 400
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_expensive_listings, con=conn)
display(df_result)
conn.close()
neighbourhood | room_type | bathrooms | bedrooms | price | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|
0 | Uptown | Entire home/apt | 2.5 | 5.0 | 875.0 | 124 | 97.0 | 0 |
1 | Near North Side | Entire home/apt | 1.0 | 2.0 | 500.0 | 117 | 94.0 | 1 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKV0hFUkUgcHJpY2UgPiA0MDAK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
# YOUR CODE BEGINS
query_west_town_listings = '''
SELECT *
FROM listings
WHERE neighbourhood = 'West Town'
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_west_town_listings, con=conn)
display(df_result)
conn.close()
neighbourhood | room_type | bathrooms | bedrooms | price | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|
0 | West Town | Entire home/apt | 1.0 | 3.0 | 70.0 | 389 | 93.0 | 1 |
1 | West Town | Entire home/apt | 3.0 | 3.0 | 168.0 | 218 | 97.0 | 1 |
2 | West Town | Entire home/apt | 1.0 | 2.0 | 75.0 | 189 | 91.0 | 0 |
3 | West Town | Private room | 2.0 | 1.0 | 57.0 | 127 | 89.0 | 0 |
4 | West Town | Private room | 2.0 | 1.0 | 78.0 | 143 | 90.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
133 | West Town | Private room | 1.0 | 1.0 | 55.0 | 126 | 100.0 | 1 |
134 | West Town | Entire home/apt | 1.0 | 1.0 | 95.0 | 125 | 98.0 | 1 |
135 | West Town | Entire home/apt | 1.0 | 2.0 | 68.0 | 112 | 93.0 | 0 |
136 | West Town | Entire home/apt | 1.0 | 2.0 | 58.0 | 140 | 99.0 | 1 |
137 | West Town | Entire home/apt | 1.0 | 3.0 | 64.0 | 134 | 98.0 | 1 |
138 rows × 8 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKV0hFUkUgbmVpZ2hib3VyaG9vZCA9ICdXZXN0IFRvd24nCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
# YOUR CODE BEGINS
query_sort_by_price = '''
SELECT *
FROM listings
ORDER BY price
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_sort_by_price, con=conn)
display(df_result)
conn.close()
neighbourhood | room_type | bathrooms | bedrooms | price | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|
0 | West Ridge | Shared room | 2.0 | 1.0 | 13.0 | 153 | 95.0 | 0 |
1 | North Lawndale | Shared room | 1.0 | 1.0 | 14.0 | 118 | 91.0 | 0 |
2 | West Ridge | Shared room | 2.0 | 1.0 | 16.0 | 158 | 92.0 | 0 |
3 | West Ridge | Shared room | 2.0 | 1.0 | 16.0 | 124 | 95.0 | 0 |
4 | Lower West Side | Private room | 2.0 | 1.0 | 19.0 | 112 | 94.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
821 | Near North Side | Entire home/apt | 2.0 | 2.0 | 344.0 | 128 | 94.0 | 1 |
822 | West Town | Entire home/apt | 3.0 | 3.0 | 379.0 | 210 | 98.0 | 0 |
823 | Lake View | Entire home/apt | 2.0 | 5.0 | 400.0 | 113 | 99.0 | 1 |
824 | Near North Side | Entire home/apt | 1.0 | 2.0 | 500.0 | 117 | 94.0 | 1 |
825 | Uptown | Entire home/apt | 2.5 | 5.0 | 875.0 | 124 | 97.0 | 0 |
826 rows × 8 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gTElTVElOR1MKT1JERVIgQlkgcHJpY2UK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
# YOUR CODE BEGINS
query_sort_by_price_desc = '''
SELECT price
FROM listings
ORDER BY price DESC
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_sort_by_price_desc, con=conn)
display(df_result)
conn.close()
price | |
---|---|
0 | 875.0 |
1 | 500.0 |
2 | 400.0 |
3 | 379.0 |
4 | 344.0 |
... | ... |
821 | 19.0 |
822 | 16.0 |
823 | 16.0 |
824 | 14.0 |
825 | 13.0 |
826 rows × 1 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBwcmljZQpGUk9NIExJU1RJTkdTCk9SREVSIEJZIHByaWNlIERFU0MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_series_equal(df_result['price'].reset_index(drop=True),
df_check['price'].reset_index(drop=True))
conn.close()
room_type
, bathrooms
, bedrooms
, price
, and number_of_reviews
columns (in the same order).price
is less than 100 AND the number of reviews is greather than 500
.number_of_reviews
in descending order.query_final
.# YOUR CODE BEGINS
query_final = '''
SELECT room_type, bathrooms, bedrooms, price, number_of_reviews
FROM listings
WHERE (price < 100) AND (number_of_reviews > 500)
ORDER BY number_of_reviews DESC
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb.db')
df_result = pd.read_sql_query(query_final, con=conn)
display(df_result)
conn.close()
room_type | bathrooms | bedrooms | price | number_of_reviews | |
---|---|---|---|---|---|
0 | Private room | 1.0 | 1.0 | 80.0 | 658 |
1 | Entire home/apt | 1.0 | 1.0 | 39.0 | 642 |
2 | Entire home/apt | 1.0 | 1.0 | 85.0 | 625 |
3 | Hotel room | 1.0 | 1.0 | 74.0 | 607 |
4 | Entire home/apt | 1.0 | 1.0 | 75.0 | 543 |
5 | Entire home/apt | 1.0 | 2.0 | 77.0 | 515 |
6 | Entire home/apt | 1.0 | 1.0 | 65.0 | 504 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCByb29tX3R5cGUsIGJhdGhyb29tcywgYmVkcm9vbXMsIHByaWNlLCBudW1iZXJfb2ZfcmV2aWV3cwpGUk9NIExJU1RJTkdTCldIRVJFIChwcmljZSA8IDEwMCkgQU5EIChudW1iZXJfb2ZfcmV2aWV3cyA+IDUwMCkKT1JERVIgQlkgbnVtYmVyX29mX3Jldmlld3MgREVTQwo=').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.reset_index(drop=True),
df_check.reset_index(drop=True))
conn.close()
Like Pandas, you can aggregate data across multiple columns or tables.
▶️ Run the code below to create a new SQLite database file named airbnb.db
and create a table named listings
.
# Connect to airbnb.db database file
# If the file does not exist, create a new file
conn = sqlite3.connect('airbnb-boston-medium.db')
# Create a cursor
# A cursor enables users of a DBMS to traverse through the result set
# without having to retrieve all results at once
c = conn.cursor()
# Drop (delete) listings table if it already exists
c.execute('DROP TABLE IF EXISTS listings')
conn.commit()
# Define a query
# Triple quotes (''') denote multiline strings
create_table_query = '''
CREATE TABLE IF NOT EXISTS listings (
name TEXT,
neighbourhood TEXT,
room_type TEXT,
bedrooms REAL,
bathrooms REAL,
price REAL,
minimum_nights INT,
availability_365 INT,
number_of_reviews INTEGER,
review_score REAL,
is_superhost INT
)
'''
c.execute(create_table_query)
conn.commit()
conn.close()
# DO NOT CHANGE THE CODE BELOW
import glob
import os
user_db_files = glob.glob('*.db')
# Check if airbnb.db file exists in current working directory
tc.assertTrue('airbnb-boston-medium.db' in user_db_files, f'Check if airbnb-boston-medium.db exists in {os.getcwd()}')
▶️ Run the code below to create a DataFrame named df_listings
.
df_listings = pd.read_csv('https://github.com/bdi475/datasets/raw/main/boston-airbnb-listings-small.csv')
df_listings_backup = df_listings.copy()
display(df_listings.head(5))
name | neighbourhood | room_type | bedrooms | bathrooms | minimum_nights | price | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HARBORSIDE-Walk to subway | East Boston | Entire home/apt | 1.0 | 1.0 | 28 | 150.0 | 123 | 17 | 99.0 | 0 |
1 | **$49 Special ** Private! Minutes to center! | Roxbury | Entire home/apt | 1.0 | 1.0 | 3 | 145.0 | 0 | 107 | 95.0 | 1 |
2 | $99 Special!! Home Away! Condo | Roxbury | Entire home/apt | 1.0 | 1.0 | 3 | 169.0 | 0 | 115 | 96.0 | 1 |
3 | Bright 1bed facing Golden Dome | Downtown | Entire home/apt | 1.0 | 1.0 | 91 | 81.0 | 281 | 32 | 96.0 | 1 |
4 | The Dorset Redline | 3BR 1BA | Walk to Redline... | Dorchester | Entire home/apt | 3.0 | 1.0 | 32 | 129.0 | 103 | 52 | 86.0 | 0 |
▶️ Run the code below to populate the listings
table. All data in df_listings
will be inserted to the table.
conn = sqlite3.connect('airbnb-boston-medium.db')
c = conn.cursor()
tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])
if 'listings' in tables:
c.execute(f'DELETE FROM listings')
conn.commit()
df_listings.to_sql(name='listings', index=False, con=conn, if_exists='append')
conn.close()
# DO NOT CHANGE THE CODE IN THIS CELL
conn_checker = sqlite3.connect('airbnb-boston-medium.db')
table_to_check = 'listings'
# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your airbnb-boston-medium.db file!')
conn_checker.close()
▶️ Run the code below to display df_listings
.
display(df_listings)
name | neighbourhood | room_type | bedrooms | bathrooms | minimum_nights | price | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HARBORSIDE-Walk to subway | East Boston | Entire home/apt | 1.0 | 1.0 | 28 | 150.0 | 123 | 17 | 99.0 | 0 |
1 | **$49 Special ** Private! Minutes to center! | Roxbury | Entire home/apt | 1.0 | 1.0 | 3 | 145.0 | 0 | 107 | 95.0 | 1 |
2 | $99 Special!! Home Away! Condo | Roxbury | Entire home/apt | 1.0 | 1.0 | 3 | 169.0 | 0 | 115 | 96.0 | 1 |
3 | Bright 1bed facing Golden Dome | Downtown | Entire home/apt | 1.0 | 1.0 | 91 | 81.0 | 281 | 32 | 96.0 | 1 |
4 | The Dorset Redline | 3BR 1BA | Walk to Redline... | Dorchester | Entire home/apt | 3.0 | 1.0 | 32 | 129.0 | 103 | 52 | 86.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1334 | Quiet and Sunny Top Floor Studio! South End LOVE! | South End | Entire home/apt | 1.0 | 1.0 | 1 | 92.0 | 328 | 14 | 77.0 | 0 |
1335 | FOUND Boston Common - Standard Full Room | Bay Village | Private room | 1.0 | 1.0 | 1 | 100.0 | 253 | 53 | 90.0 | 0 |
1336 | FOUND Boston Common - Queen Room | Downtown | Private room | 1.0 | 1.0 | 1 | 114.0 | 254 | 49 | 89.0 | 0 |
1337 | FOUND Boston Common - Double Queen Room | Downtown | Private room | 1.0 | 1.0 | 1 | 143.0 | 252 | 43 | 93.0 | 0 |
1338 | FOUND Boston Common - Single Room | Downtown | Private room | 1.0 | 1.0 | 1 | 87.0 | 254 | 26 | 88.0 | 0 |
1339 rows × 11 columns
# YOUR CODE BEGINS
query_select_all = '''
SELECT *
FROM listings
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_select_all, con=conn)
display(df_result)
conn.close()
name | neighbourhood | room_type | bedrooms | bathrooms | price | minimum_nights | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HARBORSIDE-Walk to subway | East Boston | Entire home/apt | 1.0 | 1.0 | 150.0 | 28 | 123 | 17 | 99.0 | 0 |
1 | **$49 Special ** Private! Minutes to center! | Roxbury | Entire home/apt | 1.0 | 1.0 | 145.0 | 3 | 0 | 107 | 95.0 | 1 |
2 | $99 Special!! Home Away! Condo | Roxbury | Entire home/apt | 1.0 | 1.0 | 169.0 | 3 | 0 | 115 | 96.0 | 1 |
3 | Bright 1bed facing Golden Dome | Downtown | Entire home/apt | 1.0 | 1.0 | 81.0 | 91 | 281 | 32 | 96.0 | 1 |
4 | The Dorset Redline | 3BR 1BA | Walk to Redline... | Dorchester | Entire home/apt | 3.0 | 1.0 | 129.0 | 32 | 103 | 52 | 86.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1334 | Quiet and Sunny Top Floor Studio! South End LOVE! | South End | Entire home/apt | 1.0 | 1.0 | 92.0 | 1 | 328 | 14 | 77.0 | 0 |
1335 | FOUND Boston Common - Standard Full Room | Bay Village | Private room | 1.0 | 1.0 | 100.0 | 1 | 253 | 53 | 90.0 | 0 |
1336 | FOUND Boston Common - Queen Room | Downtown | Private room | 1.0 | 1.0 | 114.0 | 1 | 254 | 49 | 89.0 | 0 |
1337 | FOUND Boston Common - Double Queen Room | Downtown | Private room | 1.0 | 1.0 | 143.0 | 1 | 252 | 43 | 93.0 | 0 |
1338 | FOUND Boston Common - Single Room | Downtown | Private room | 1.0 | 1.0 | 87.0 | 1 | 254 | 26 | 88.0 | 0 |
1339 rows × 11 columns
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gbGlzdGluZ3MK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
conn.close()
df_listings
, find all rows with a price of over $1,000 (df_listings['price'] > 1000
).df_expensive_listings
.Your index column may contain different values.
name | neighbourhood | room_type | bedrooms | bathrooms | minimum_nights | price | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
502 | Large 4BDR near Harvard with Parking | Allston | Entire home/apt | 4 | 2.5 | 1 | 1250 | 0 | 136 | 98 | 0 |
669 | The Historic House in the North End/Waterfront | North End | Entire home/apt | 3 | 2 | 91 | 1052 | 0 | 104 | 98 | 1 |
# YOUR CODE BEGINS
df_expensive_listings = df_listings[df_listings['price'] > 1000]
# YOUR CODE ENDS
display(df_expensive_listings)
name | neighbourhood | room_type | bedrooms | bathrooms | minimum_nights | price | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
502 | Large 4BDR near Harvard with Parking | Allston | Entire home/apt | 4.0 | 2.5 | 1 | 1250.0 | 0 | 136 | 98.0 | 0 |
669 | The Historic House in the North End/Waterfront | North End | Entire home/apt | 3.0 | 2.0 | 91 | 1052.0 | 0 | 104 | 98.0 | 1 |
# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.query(f"{'PrIcE'.lower()} > {25 * 40}")
pd.testing.assert_frame_equal(df_expensive_listings.sort_values(df_expensive_listings.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
# YOUR CODE BEGINS
query_expensive_listings = '''
SELECT *
FROM listings
WHERE price > 1000
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_expensive_listings, con=conn)
display(df_result)
conn.close()
name | neighbourhood | room_type | bedrooms | bathrooms | price | minimum_nights | availability_365 | number_of_reviews | review_score | is_superhost | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Large 4BDR near Harvard with Parking | Allston | Entire home/apt | 4.0 | 2.5 | 1250.0 | 1 | 0 | 136 | 98.0 | 0 |
1 | The Historic House in the North End/Waterfront | North End | Entire home/apt | 3.0 | 2.0 | 1052.0 | 91 | 0 | 104 | 98.0 | 1 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gbGlzdGluZ3MKV0hFUkUgcHJpY2UgPiAxMDAwCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()
df_listings
, find all rows where the listing has:name
, bedrooms
, bathrooms
, price
, review_score
df_large_listings
.Your index column may contain different values.
name | bedrooms | bathrooms | price | review_score | ||
---|---|---|---|---|---|---|
151 | Beautiful Philadelphia house near Harvard U. w/ pk | 5 | 3 | 600 | 82 | |
250 | Spacious 4 BR | 2.5 BA Single-Family Home. | 5 | 3 | 373 | 97 |
921 | ★Large Retreat 5BR w/3BA★ Close to Everything | 5 | 3 | 738 | 95 |
# YOUR CODE BEGINS
df_large_listings = df_listings[(df_listings['bedrooms'] >= 5) & (df_listings['bathrooms'] >= 3)]
df_large_listings = df_large_listings[['name', 'bedrooms', 'bathrooms', 'price', 'review_score']]
# YOUR CODE ENDS
display(df_large_listings)
name | bedrooms | bathrooms | price | review_score | |
---|---|---|---|---|---|
151 | Beautiful Philadelphia house near Harvard U. w... | 5.0 | 3.0 | 600.0 | 82.0 |
250 | Spacious 4 BR | 2.5 BA Single-Family Home. | 5.0 | 3.0 | 373.0 | 97.0 |
921 | ★Large Retreat 5BR w/3BA★ Close to Everything | 5.0 | 3.0 | 738.0 | 95.0 |
# DO NOT CHANGE THE CODE IN THIS CELL
__r = 'oo'.join(['R', 'Ms']).lower()
df_check = df_listings.query(f'bed{__r} > {2 ** 2} & bath{__r} > {2 ** 1}') \
[['_'.join(['review', 'score']), 'price', 'bath' + __r, 'bed' + __r, 'name'][::-1]]
pd.testing.assert_frame_equal(df_large_listings.sort_values(df_large_listings.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
listings
table, find all rows where the listing has:name
, bedrooms
, bathrooms
, price
, review_score
query_large_listings
.# YOUR CODE BEGINS
query_large_listings = '''
SELECT name, bedrooms, bathrooms, price, review_score
FROM listings
WHERE (bedrooms >= 5) AND (bathrooms >= 3)
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_large_listings, con=conn)
display(df_result)
conn.close()
name | bedrooms | bathrooms | price | review_score | |
---|---|---|---|---|---|
0 | Beautiful Philadelphia house near Harvard U. w... | 5.0 | 3.0 | 600.0 | 82.0 |
1 | Spacious 4 BR | 2.5 BA Single-Family Home. | 5.0 | 3.0 | 373.0 | 97.0 |
2 | ★Large Retreat 5BR w/3BA★ Close to Everything | 5.0 | 3.0 | 738.0 | 95.0 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBuYW1lLCBiZWRyb29tcywgYmF0aHJvb21zLCBwcmljZSwgcmV2aWV3X3Njb3JlCkZST00gbGlzdGluZ3MKV0hFUkUgKGJlZHJvb21zID49IDUpIEFORCAoYmF0aHJvb21zID49IDMpCg=='
).decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.columns.tolist(), df_check.columns.tolist(), 'Incorrect set of columns or order')
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()
# YOUR CODE BEGINS
num_listings = df_listings.shape[0]
# YOUR CODE ENDS
print(f'There are {num_listings} AirBnB listings in the dataset.')
There are 1339 AirBnB listings in the dataset.
# DO NOT CHANGE THE CODE IN THIS CELL
tc.assertEqual(num_listings, len(df_listings_backup.index), f'Incorrect number of listings - should be {len(df_listings_backup.index)}')
# YOUR CODE BEGINS
query_num_listings = '''
SELECT COUNT(*)
FROM listings
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_num_listings, con=conn)
display(df_result)
conn.close()
COUNT(*) | |
---|---|
0 | 1339 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBDT1VOVCgqKQpGUk9NIGxpc3RpbmdzCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.iloc[0, 0], df_check.iloc[0, 0], 'Incorrect number of listings')
conn.close()
df_listings
, find the number of listings by neighbourhood.df_by_neighbourhood
.df_by_neighbourhood
should have the following two columns.neighbourhood
: Name of the neighbourhoodnum_listings
: Number of listings in a neighbourhoodprint(df_by_neighbourhood.columns.tolist())
should print ['neighbourhood', 'num_listings']
.df_by_neighbourhood
by num_listings
in descending order.Your index column may contain different values.
neighbourhood | num_listings | |
---|---|---|
7 | Dorchester | 215 |
12 | Jamaica Plain | 129 |
19 | Roxbury | 107 |
9 | East Boston | 90 |
22 | South End | 86 |
# YOUR CODE BEGINS
df_by_neighbourhood = df_listings.groupby('neighbourhood', as_index=False).agg({
'name': 'count'
})
df_by_neighbourhood.rename(columns={'name': 'num_listings'}, inplace=True)
df_by_neighbourhood.sort_values('num_listings', ascending=False, inplace=True)
# YOUR CODE ENDS
df_by_neighbourhood.head(5)
neighbourhood | num_listings | |
---|---|---|
7 | Dorchester | 215 |
12 | Jamaica Plain | 129 |
19 | Roxbury | 107 |
9 | East Boston | 90 |
22 | South End | 86 |
# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.pivot_table(index=['neigh' + 'BoUr'.lower() + 'HooD'.lower()],
values=['NaMe'.lower()],
aggfunc=('CoU' + 'Nt').lower()) \
.rename(columns={'NaME'.lower(): '_'.join(['NuM', 'LiSTinGs']).lower()}) \
.reset_index().sort_values('_'.join(['NuM', 'LiSTinGs']).lower(), ascending=bool(0))
pd.testing.assert_series_equal(df_by_neighbourhood['num_listings'].reset_index(drop=True),
df_check['num_listings'].reset_index(drop=True))
df_by_neighbourhood = df_listings.groupby('neighbourhood', as_index=False).agg({
'name': 'count'
})
df_by_neighbourhood.rename(columns={'name': 'num_listings'}, inplace=True)
df_by_neighbourhood.sort_values('num_listings', ascending=False, inplace=True)
listings
table.neighbourhood
: Name of the neighbourhoodnum_listings
: Number of listings in a neighbourhoodquery_num_listings_by_neighbourhood
.# YOUR CODE BEGINS
query_num_listings_by_neighbourhood = '''
SELECT neighbourhood, COUNT(*) AS num_listings
FROM listings
GROUP BY neighbourhood
ORDER BY num_listings DESC
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_num_listings_by_neighbourhood, con=conn)
display(df_result)
conn.close()
neighbourhood | num_listings | |
---|---|---|
0 | Dorchester | 215 |
1 | Jamaica Plain | 129 |
2 | Roxbury | 107 |
3 | East Boston | 90 |
4 | South End | 86 |
5 | Downtown | 82 |
6 | Allston | 82 |
7 | South Boston | 67 |
8 | Brighton | 66 |
9 | Beacon Hill | 54 |
10 | Back Bay | 50 |
11 | Charlestown | 47 |
12 | Fenway | 42 |
13 | Roslindale | 38 |
14 | North End | 33 |
15 | Mattapan | 30 |
16 | Hyde Park | 25 |
17 | West Roxbury | 24 |
18 | Mission Hill | 23 |
19 | Bay Village | 22 |
20 | West End | 12 |
21 | Chinatown | 7 |
22 | South Boston Waterfront | 6 |
23 | Longwood Medical Area | 1 |
24 | Leather District | 1 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBuZWlnaGJvdXJob29kLCBDT1VOVCgqKSBBUyBudW1fbGlzdGluZ3MKRlJPTSBsaXN0aW5ncwpHUk9VUCBCWSBuZWlnaGJvdXJob29kCk9SREVSIEJZIG51bV9saXN0aW5ncyBERVNDCg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_series_equal(df_result['num_listings'].reset_index(drop=True),
df_result['num_listings'].reset_index(drop=True))
conn.close()
SELECT neighbourhood, COUNT(*) AS num_listings
FROM listings
GROUP BY neighbourhood
ORDER BY num_listings DESC
df_listings
, find the average price by room type.df_price_by_room_type
.df_price_by_room_type
should have the following two columns.room_type
: Room typeprice
: Average price of the room typeprint(df_price_by_room_type.columns.tolist())
should print ['room_type', 'price']
.df_price_by_room_type
by price
in ascending order.Your index column may contain different values.
room_type | price | |
---|---|---|
3 | Shared room | 34.75 |
2 | Private room | 82.2278 |
0 | Entire home/apt | 201.135 |
1 | Hotel room | 206.273 |
# YOUR CODE BEGINS
df_price_by_room_type = df_listings.groupby('room_type', as_index=False).agg({
'price': 'mean'
})
df_price_by_room_type.sort_values('price', inplace=True)
# YOUR CODE ENDS
df_price_by_room_type
room_type | price | |
---|---|---|
3 | Shared room | 34.750000 |
2 | Private room | 82.227848 |
0 | Entire home/apt | 201.134890 |
1 | Hotel room | 206.272727 |
# DO NOT CHANGE THE CODE IN THIS CELL
df_check = df_listings_backup.pivot_table(index=['RoOm_'.lower() + 'TyPe'.lower()],
values=['price'],
aggfunc=np.mean) \
.reset_index() \
.sort_values('price', ascending=bool(not False))
pd.testing.assert_frame_equal(df_price_by_room_type.sort_values(df_price_by_room_type.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
listings
table.query_price_by_room_type
.room_type
: Room typeprice
: Average price of the room typeprice
in ascending order.# YOUR CODE BEGINS
query_price_by_room_type = '''
SELECT room_type, AVG(price) as price
FROM listings
GROUP BY room_type
ORDER BY price
'''
# YOUR CODE ENDS
conn = sqlite3.connect('airbnb-boston-medium.db')
df_result = pd.read_sql_query(query_price_by_room_type, con=conn)
display(df_result)
conn.close()
room_type | price | |
---|---|---|
0 | Shared room | 34.750000 |
1 | Private room | 82.227848 |
2 | Entire home/apt | 201.134890 |
3 | Hotel room | 206.272727 |
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('airbnb-boston-medium.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCByb29tX3R5cGUsIEFWRyhwcmljZSkgYXMgcHJpY2UKRlJPTSBsaXN0aW5ncwpHUk9VUCBCWSByb29tX3R5cGUKT1JERVIgQlkgcHJpY2UK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
tc.assertEqual(df_result.shape, df_check.shape, 'Number of rows and/or columns is different')
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()
SELECT room_type, AVG(price) as price
FROM listings
GROUP BY room_type
ORDER BY price