You'll analyze real California house prices to understand wealth distribution. Calculate:
This will show you why median home price is more meaningful than average!
First, you'll need to access our shared datasets folder and add it to your Drive.
You should now see "datasets" in your Google Drive!
In your Colab notebook, run this code to access your Google Drive:
# Connect to Google Drive
from google.colab import drive
drive.mount("/content/gdrive")
When prompted:
# Open and read the file
file_content = open("/content/gdrive/MyDrive/datasets/house_prices.txt", "r").read()
# Convert file lines to numbers
prices = []
for line in file_content.strip().split('\n'):
prices.append(float(line))
print(f"Loaded {len(prices)} house prices")
print(f"First 5 prices: {prices[:5]}")
# Calculate the mean
# TODO: Sum all prices and divide by count
# Calculate the median
# TODO: Sort the prices first
# TODO: Find the middle value (remember even vs odd count!)
# Print your results
print(f"Mean house price: ${mean_price:,.2f}")
print(f"Median house price: ${median_price:,.2f}")
What do you notice about the difference?
Quartiles divide your data into four equal parts:
# Sort the data first!
sorted_prices = sorted(prices)
n = len(sorted_prices)
# Calculate quartile positions
# TODO: Q1 is at position n//4
# TODO: Q2 is at position n//2
# TODO: Q3 is at position 3*n//4
print(f"Q1 (25th percentile): ${q1:,.2f}")
print(f"Q2 (50th percentile): ${q2:,.2f}")
print(f"Q3 (75th percentile): ${q3:,.2f}")
Write a function that tells you what percentile a given house price is at:
def find_percentile(house_price, all_prices):
sorted_prices = sorted(all_prices)
# Count how many prices are below this price
# TODO: Loop through sorted_prices and count
# Calculate the percentile
# TODO: (count_below / total_count) * 100
return percentile
# Test your function
my_house = 450000
result = find_percentile(my_house, prices)
print(f"A ${my_house:,} house is at the {result:.1f} percentile")
Visualize the distribution using print statements where each █
represents 20 houses.
# Define price ranges (bins)
bin_0_50k = 0
bin_50_100k = 0
bin_100_150k = 0
# Etc.
# Loop through the prices and, for each one,
# use an if/elif/else statement to count the house
# in the correct bin
# This function, which makes a bar of the correct length,
# has been written for you.
def make_bar(count):
bar = ""
for i in range(size // 20):
bar += "█"
return bar
# Print the histogram
print("California House Price Distribution:")
print(f"$0-50k: {make_bar(bin_0_50k)}")
print(f"$50-100k: {make_bar(bin_50_100k)}")
print(f"$100-150k: {make_bar(bin_100_150k)}")
# TODO: Print the other bins
# Connect to Google Drive
from google.colab import drive
drive.mount("/content/gdrive")
# Open and read the file
file_content = open("/content/gdrive/MyDrive/datasets/house_prices.txt", "r").read()
# Convert file lines to numbers
prices = []
for line in file_content.strip().split('\n'):
prices.append(float(line))
print(f"Loaded {len(prices)} house prices")
print(f"First 5 prices: {prices[:5]}")