Lab 3: Optmization, Regression, and Inference

Welcome to Lab 3!

This lab will go over:

  • How to find the best number for a job (optimization)
  • How to find the best-fitting line for a scatter plot (regression using optimization)
  • How to compute confidence intervals for properties of regression models (regression inference)

First we'll learn about optimal ice cream truck placement. Then we'll use regression to get an accurate estimate of the age of the universe from pictures of exploding stars. Feel free to ask a neighbor or lab assistant for help!

This is the final lab for the first section of this workshop: the core workshop topics. We hope you have enjoyed the content so far. On Thursday and Friday, we will cover the second section of the workshop: optional curriculum integration extension activities. We hope to see you there!

What this lab will cover:

What you need to do:

  • Read the content, complete the questions
  • Load the autograder tests, log in
  • Autograde your solutions for questions 1-2
  • Submit the assignment
In [ ]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
from lab_utils import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt'fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines load the autograder tests. 
from client.api.notebook import Notebook
ok = Notebook('lab03.ok')
_ = ok.auth(inline=False)

1. Optimization

Data 8 is branching out into the ice cream market. We're catering to San Francisco hipsters, so we operate a truck that sells our fresh organic Sriracha-Kale ice cream. Today we have driven our truck to Ocean Beach, a long, narrow beach on the western coast of the city.

Upon arriving, we find that our potential customers are spread out along the beach. We decide we want to park our truck in the location that's closest on average to all the customers. That way, customers will be more likely to come to our truck.

(This may not be a great way to choose our truck's location. Maybe you can think of a better way to decide on a location. Computer-based optimization lets you choose whatever criteria you want, within reason!)

We canvas the beach and record the location of each beachgoer in a table called customers. The beach is oriented roughly North/South, and it's narrow, so we ignore how close each beachgoer is to the water, and record only how far north each person is from the southern end of the beach.

Suppose there are 2 people on the beach, at 600 meters and 950 meters from the Southern end, respectively. If we park our truck at 750 meters, the average distance from our truck to customers is:

$$\frac{|600 - 750| + |950 - 750|}{2},$$

Here's a straightforward way to write that in Python:

In [ ]:
(abs(600 - 750) + abs(950 - 750)) / 2

We've packaged this into a function called average_distance_from_customers so you can do this calculation with different customers and truck locations. It takes 2 arguments:

  1. An array of customer locations
  2. A truck location

It returns the average distance of those customers from a truck at that location. The next cell defines the function and recomputes our distance.

In [ ]:
# Given an array of customer locations (numbers) and a potential location
# for our truck (a single number), computes the average distance between
# that truck location and the customers.

def average_distance_from_customers(customer_locations, truck_location):
    return np.mean(np.abs(np.array(customer_locations) - truck_location))

average_distance_from_customers(np.array([600, 950]), 750)

1.1 Parking at the Average

It seems reasonable to park our truck at the average (or "mean") location of the customers. Let's investigate how well that works.

Suppose we have 3 customers, and they're located 800, 1000, and 1800 meters from the Southern end of the beach, respectively. The mean of those locations is np.mean([800, 1000, 1800]), or 1200 meters.

If we park our ice cream truck there, we can compute the average distance from customers like this:

In [ ]:
q1_customer_locations = np.array([800, 1000, 1800])
q1_truck_location = np.mean(q1_customer_locations)
q1_avg_distance = average_distance_from_customers(q1_customer_locations, q1_truck_location)

Question 1

If we park our truck at 1100 meters instead, what's the average distance from our truck to customers? Compute it below by calling the function average_distance_from_customers in an appropriate way.

In [ ]:
# For convenience, we've named the truck location.
q1_truck_location = 1100

# Fill in q1_avg_distance. Use code to compute it.
q1_avg_distance = ...

Test your solution!

In [ ]:
_ = ok.grade('q1')

Somehow, parking at 1100 meters is better than parking at the mean (1200 meters)! So the mean can't be the best location.

1.2 Finding the Best Location

Now we'll look at the full dataset and use our computer to find the best truck location. Run the next cell to load it. The cell also shows a histogram of the customer locations.

In [ ]:
# Just run this cell.
customers = Table.read_table("data/customers.csv")
customers.hist(bins=np.arange(0, 2001, 100))

Given these customer locations, we want to find a single location for our truck. If we park our truck at that location, we want it to result in the smallest average distance from our truck to customers.

First, let's write code to evaluate how well any single location meets our goal. Below we've defined a function called average_distance. It takes a single number as its argument (a truck location) and returns the average distance from that location to the customers in the customers table. (Since we've already written the average_distance_from_customers function to do this for any set of customers, it's pretty easy.)

In [ ]:
def average_distance(truck_location):
    return average_distance_from_customers(customers.column(0), truck_location)


Since average_distance tells us how well we've achieved our objective, it's called an objective function. So now let's find the distance that produces the smallest value of this objective function. (If you came up with some better criterion than the average distance to customers, then you would write a different objective function here. Everything else below would be pretty much the same.)

Here's one way to find the best location:

  1. Check a bunch of locations
  2. Compute the average distance (the value of our objective function average_distance) for each one
  3. Pick the location with the smallest average distance

This is called a "brute-force" technique because it uses a lot of computation. But it gets the job done.

To do it manually, it helps to visualize the objective function values for each potential truck location. The cell below generates such a graph.

In [ ]:
# First we'll figure out the locations to try.  We try
# every meter on the beach, plus half a kilometer in
# either direction for good measure.
lowest_location = int(min(customers.column(0))) - 500
highest_location = int(max(customers.column(0))) + 501
locations = np.arange(lowest_location, highest_location, 1)
potential_locations = Table().with_column("Truck distance from south end (m)", locations)

# An array of average distances from customers, one for
# each potential truck location in potential_locations.
average_distances = potential_locations.apply(average_distance, "Truck distance from south end (m)")

# Now we make a scatter plot of the average distances with
# potential truck locations on the horizontal axis and the
# average distances from customers on the vertical axis.
locations_with_distances = potential_locations.with_column("Average distance to customers (m)", average_distances)
locations_with_distances.scatter(0, 1)

Question 2

Looking at the graph, what would you say is the best location for our truck, rounded to the nearest 100? Fill in your guess in the cell below.

In [ ]:
# Fill in your guess here:
q2_best_location_guess = ...

Test your solution!

In [ ]:
_ = ok.grade('q2')

The built-in function minimize does basically the same thing you just did. It takes as its argument a function, the objective function. It returns the input that produces the smallest output value of the objective function.

Here's how we'd use minimize to find the best location for our ice cream truck:

In [ ]:
best_location = minimize(average_distance)

Was your guess close?

2. Regression

A least-squares regression line for a dataset is the line that produces the smallest possible squared prediction error for those data. Now that we know how to minimize things, we can do least-squares regression without resorting to formulae!

We're going to use linear regression to estimate the age of the universe.

In the early 20th century, the most popular cosmological theory suggested that the universe had always existed at a fixed size. Today, the Big Bang theory prevails: Our universe started out very small and is still expanding.

A consequence of this is Hubble's Law, which says that the expansion of the universe creates the appearance that every galaxy near us is moving away from us at a constant rate. If we extrapolate that motion backwards to the time when everything would have been on top of everything else, that time is (roughly) the beginning of the universe!

(Note: We are not physicists, and this is just a very high-level abstract of the idea.)

2.1 Cosmology

First, we need to know the distance-from-Earth and speed-away-from-Earth of many stars. Using pictures taken by very accurate telescopes and a lot of physics, astronomers have been able to estimate both. It turns out that nearby supernovae, stars that have recently died and exploded, are among the best sources of this data. This picture taken by the Hubble telescope shows an entire galaxy, with one bright supernova at the bottom left.

Our astronomical data for today will come from the Supernova Cosmology Project at Lawrence Berkeley Lab. The original dataset is here, with (brief) documentation here. Each row in the table corresponds to a supernova near Earth that was observed by astronomers. From pictures like the one above, the astronomers deduced how far away each supernova was from Earth and how fast it was moving away from Earth. Their deductions were good, but not perfect.

Run the cell below to load the data into a table called close_novas and make a scatter plot with a best-fit line.

In [ ]:
# Just run this cell.
close_novas = Table.read_table("data/close_novas.csv")

close_novas.scatter(1, 0, fit_line=True)

Examining the line, it looks like the slope is around 150/.01. Converting to the right units, we get 15000 million years, or 15 billion. That sounds about right!

2.2 Fitting the Line Yourself

We used fit_line=True, which is very convenient, in our call to scatter in the previous cell. Let's peek under the hood here. Inside, scatter uses an optimization procedure to compute the line it draws. The least-squares regression line for our supernova data is

  • the line
  • with the smallest average (over all the supernovae we observe)
  • error,
  • squared,
  • where the error is the difference between the prediction based on the line and the supernova's actual distance from Earth.

The plot below shows one line we could try to use to fit this dataset (not the least-squares line), along with the errors made by that line for a few of the supernovas. Squaring the length of each red bar, then averaging those squared lengths, tells us how badly the line fits.

In [ ]:
# Just run this cell.

# Plot all the novas first.
close_novas.scatter("Speed (parsecs/year)", "Distance (million parsecs)")
plt.suptitle("A potential fit line (plus some errors)")

# Over that plot, plot the proposed line, which is a little
# bit too high.
slope = 16000
right_line_end = 0.02
plt.plot([0.0, right_line_end], [0, slope*right_line_end], "b-", linewidth=1)

# Plot some of the errors:
novas_sample = close_novas.take([0, 1, 2, 3, 4, 41])
plt.scatter(novas_sample.column(1), novas_sample.column(0), c="r", zorder=2, s=70)
for i in range(novas_sample.num_rows):
    x = novas_sample.column("Speed (parsecs/year)").item(i)
    y = novas_sample.column("Distance (million parsecs)").item(i)
    line_y = slope*x
    plt.plot([x, x], [y, line_y], "r-")

What we want is to choose a line that minimizes the average squared error. To simplify things, we'll assume that the vertical intercept of our lines is 0 (since the physical model implies that's true). So we only have to choose one thing, the slope of the line. We'll do it in steps.

First we need a way to compute errors. The cell below defines a function to do that.

In [ ]:
def predicted_distances(line_slope, speeds):
    return line_slope * speeds

def errors(line_slope):
    predictions = predicted_distances(line_slope, close_novas.column("Speed (parsecs/year)"))
    return predictions - close_novas.column("Distance (million parsecs)") 

For example, we can examine the prediction errors for a line with slope 16,000 million years.

In [ ]:
predictions_16000 = predicted_distances(16000, close_novas.column("Speed (parsecs/year)"))
errors_16000 = errors(16000)

plt.scatter(close_novas.column(1), close_novas.column(0), color="black", label="Actual distance")
plt.scatter(close_novas.column(1), predictions_16000, color="blue", label="Predicted distance")
plt.scatter(close_novas.column(1), errors_16000, color="red", label="Prediction error")
plt.ylabel("Distance (million parsecs)")
plt.xlabel("Speed (parsecs/year)")
_ = plt.legend(loc='center left', bbox_to_anchor=(.7, .5))

The predictions are mostly going above the true distances. If you look at the errors in red, almost all are positive. That means our line is a little bit too steep. We'll find a better one.

Now that we can compute errors, we just need to do the "average squared" part. mean_squared_error, defined below, does that. Given the slope of a line, it computes the mean squared error when we use that line to predict distances for our nova dataset.

In [ ]:
def mean_squared_error(slope):
    return np.mean(errors(slope) ** 2)

This is our objective function! We can graph it to see what slope is best:

In [ ]:
slopes = Table().with_column("slope", np.arange(0, 40000, 10))
mses = slopes.with_column("mean squared error", slopes.apply(mean_squared_error, "slope"))
mses.scatter(0, 1)

And, rather than eyeballing the lowest point on the graph, we can compute the best slope by calling minimize on our objective function:

In [ ]:
best_line_slope = minimize(mean_squared_error)

# This just shows our answer as a nice string, in billions of years.
"{:,} billion years".format(round(best_line_slope/1000, 4))

That slope, as we've seen, is an estimate of the age of the universe. The current best estimate of the age of the universe is 13.799 billion years. Did we get close?

3. Inference

Our close_novas dataset doesn't include all possible close supernovas. Suppose it was only a random sample from the larger population of all close supernovas. How much might our estimate have been affected by sampling error?

You might have seen classical techniques for answering this kind of question. Most statistical packages provide "standard errors" or confidence intervals for the slopes of regression lines.

Instead, we'll use the bootstrap. We just have to think of our whole procedure -- finding the slope that minimizes the squared error -- as an estimator. It's analogous to the mean or median. It's somewhat more complicated to compute, and it involves a table of 2 columns rather than just one array of data. But it's still just a number we can compute from our data, or from samples of our data.

We'll run that estimator on many resamples from our data, get a different slope each time, and see what those look like.

In [ ]:
# Here we've wrapped up the code we wrote before to find the
# least-squares slope for a nova dataset into a function.
def least_squares_slope(novas):
    def mse(potential_slope):
        predictions = predicted_distances(potential_slope, novas.column("Speed (parsecs/year)"))
        prediction_errors = predictions - novas.column("Distance (million parsecs)")
        return np.mean(prediction_errors ** 2)
    return minimize(mse)

def compute_bootstrap_slope():
    resample = close_novas.sample(with_replacement=True)
    return least_squares_slope(resample)

slope_estimates = repeat(compute_bootstrap_slope, 500)
age_estimates = slope_estimates / 1000

3.1 Interpreting Bootstrap Estimates

Now we just have 500 numbers, each one a "potential age estimate" we might have computed if we'd taken another random sample of novas. It's hard to look at 500 numbers, so we need a visualization or summary. Here are three ways to understand these numbers.


Let's just make a histogram of these age estimates:

In [ ]:
Table().with_column("Age estimate", age_estimates).hist("Age estimate", unit="billion years")

Line Plot

We can also view all the regression lines we would have generated.

In [ ]:
# Just run this cell.  Read it if you'd like to see how to generate
# plots like these.  It's a little more complicated than the plotting
# we've done; we have to use the lower-level plotting library
# Matplotlib, which is imported as the module named plt.
def plot_nova_lines(num_lines, min_x, max_x):
    # Plot all the actual data.
    close_novas.scatter(1, 0)
    xs = np.array([min_x, max_x])
    some_slopes = Table().with_columns(["index", np.arange(len(slope_estimates)), "slope", slope_estimates]).sample(num_lines)
    def plot_line(index, slope):
        predictions = xs*slope
        plt.plot(xs, predictions, lw=1, alpha=4/num_lines, label="line " + str(index))
    some_slopes.apply(plot_line, ["index", "slope"])
    plt.xlim([min_x, max_x])
    prediction_lines_message = "%d randomly-selected prediction lines" % num_lines if num_lines < len(slopes) else "all %d prediction lines" % num_lines
    plt.title("Predicted and actual nova distances\n(%s)" % (prediction_lines_message))
    if num_lines < 10:
        plt.legend(loc="center left", bbox_to_anchor=[1.0, 0.5])
plot_nova_lines(2, 0, .025)
plot_nova_lines(5, 0, .025)
plot_nova_lines(500, 0, .025)

The last plot has all 500 lines overlaid against each other. You can see that they overlap very tightly, but not perfectly.

Confidence Interval

Using the bootstrap slopes, we can generate a 95% confidence interval for the true age of the universe by finding the 2.5th and 97.5th percentiles of this histogram. In code, that looks like this:

In [ ]:
lower_bound = np.percentile(age_estimates, 2.5)
upper_bound = np.percentile(age_estimates, 97.5)
print("95% confidence interval: [{:0.3f}, {:0.3f}] billion years".format(lower_bound, upper_bound))

Notice that the confidence interval doesn't actually overlap physicists' current best guess at the age of the universe, which is 13.799 billion years.

It's not a matter of having too little data. If we didn't have as much data, our confidence interval would just be wider to compensate.

Instead, it's because the assumptions behind the method are a little wrong. Physicists have developed other, more sophisticated models that are more accurate.

A confidence interval only makes a claim about what you'd see if you had observed all the data. Since the underlying model is a little bit wrong, the method wouldn't give us a perfect answer even if we'd observed all the novas in the universe.

Submit the assignment

You have now completed lab 3! You can run the first cell below to regrade questions 1-4, for which autograder tests were provided.

In [ ]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

Once you have checked your solutions, please run the below cell to submit your lab to the OKpy autograder site. Once you run the cell, you will see a URL for the OKpy autograder site. You can click on this URL to verify that your lab was properly submitted.

In [ ]:
_ = ok.submit()