This notebook originally appeared as a post on the blog Pythonic Perambulations. The content is BSD licensed.
Update: as I was putting the finishing touches on this notebook, I noticed this post, the first in a series on Seattle Bike Blog which analyzes much of the same data used here. Apparently great minds think alike! (incidentally, to prove that I'm not just cribbing that content, check the github commit log: I wrote the bulk of this post several days before the SBB blog series was posted. Version control priority FTW!)
Update #2: I added error bars to the estimates in the final section (should have done that from the beginning, I know...
Cycling in Seattle seems to be taking off. This can be seen qualitatively in the increased visibility of advocacy groups like Seattle Neighborhood Greenways and Cascade Bicycle Club, the excellent reporting of sites like the Seattle Bike Blog, and the investment by the city in high-profile traffic safety projects such as Protected Bike Lanes, Road diets/Rechannelizations and the Seattle Bicycle Master Plan.
But, qualitative arguments aside, there is also an increasing array of quantitative data available, primarily from the Bicycle counters installed at key locations around the city. The first was the Fremont Bridge Bicycle Counter, installed in October 2012, which gives daily updates on the number of bicycles crossing the bridge: currently upwards of 5000-6000 per day during sunny commute days.
Bicycle advocates have been pointing out the upward trend of the counter, and I must admit I've been excited as anyone else to see this surge in popularity of cycling (Most days, I bicycle 22 miles round trip, crossing both the Spokane St. and Fremont bridge each way).
But anyone who looks closely at the data must admit: there is a large weekly and monthly swing in the bicycle counts, and people seem most willing to ride on dry, sunny summer days. Given the warm streak we've had in Seattle this spring, I wondered: are we really seeing an increase in cycling, or can it just be attributed to good weather?
Here I've set-out to try and answer this question. Along the way, we'll try to deduce just how much the weather conditions affect Seattleites' transportation choices.
If anyone is landing on this page via the normal bicycle advocacy channels, I should warn you that this won't look like a typical Seattle-bike-advocate blog post. I currently work as a data scientist at the University of Washington eScience Institute, where I'm incredibly fortunate to have the flexibility to spend a couple hours each week on side-projects like this. Most of my blog posts are pretty technical in nature: I tend to focus on statistical methods and visualization using the Python programming language.
This post is composed in an IPython notebook, which is a fully executable document which combines text, data, code, and visualizations all in one place. The nice thing is that anyone with a bit of desire and initiative could install the (free) IPython software on their computer and open this document, re-running and checking my results, and perhaps modifying my assumptions to see what happens. In a way, this post is as much about how to work with data as it is about what we learn from the data.
In other words, this is an entirely reproducible analysis. Every piece of data and software used here is open and freely available to anyone who wants to use it. It's an example of the direction I think data journalism should go as it starts to more and more emulate data-driven scientific research.
That said, there's a lot of technical stuff below. If you're not familiar with Python or other data analysis frameworks, don't be afraid to skip over the code and look at the plots, which I'll do my best to explain.
This post will use two datasets, which you can easily access with an internet connection. You can find the exact data I used in the GitHub repository, or access it from the original sources below.
First, I'll be using the Fremont Bridge Hourly Bicycle Counts. To download this data, go to the fremont bridge page, and do the following (I accessed this on June 6th, 2014):
Second, I'll be using weather data available at the National Climatic Data Center. We'll use weather data from the SeaTac Airport weather station. To get this data, go to the Climate Data Search page and do the following (I accessed this on June 6th, 2014):
Choose "Daily Summaries"
Choose 2012/10/1 to the present date
Search for "Station", and type in "USW00024233" (ID for SeaTac Airport weather station)
Click the icon on the map and "Add to Cart"
go to "Shopping Cart"
next page: click "select all"
click "continue"
enter email address and submit order
When the data set is ready, you will get an email with a download link. It was about an hour wait when I did it.
The first thing we're going to do is load and examine the data from the Fremont bike counter. We'll use the pandas package, a free and open source set of data analysis tools for the Python language.
# some necessary imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Load the data file, and create a column with total north/south traffic
hourly = pd.read_csv("FremontHourly.csv", index_col='Date', parse_dates=True)
hourly.columns = ['northbound', 'southbound']
hourly['total'] = hourly['northbound'] + hourly['southbound']
# Resample the data into daily and weekly totals
daily = hourly.resample('d', 'sum')
weekly = daily.resample('w', 'sum')
Now let's take a peek at our data and see what it looks like:
weekly[['northbound', 'southbound', 'total']].plot()
plt.ylabel('Weekly riders');
The red line shows the total number of weekly crossings, which is the sum of the northbound and southbound crossings.
At first glance, April and May 2014 include some spikes in the data: over 32,000 riders per week crossed the bridge one week in May! This trend might be a bit clearer if we use a moving window average: basically, for each day we'll take the average of the 30-day period around it:
pd.stats.moments.rolling_mean(daily['total'], 30).plot();
This is the increased ridership that folks have been talking about. There is some seasonal variation, but the trend seems clear: 2014 has seen a lot of cyclists crossing the bridge.
But it is clear that there is still some seasonal variation. What we're going to try to do below is to model this variation based on our intuition about what factors might come into play in people's decision about whether to ride.
For simplicity, I'm going to stick with a linear model here. It would be possible to go deeper and use a more sophisticated model (I'd eventually like to try Random Forests), but a linear model should give us a good approximation of what's happening.
The largest component of the variation we see is a seasonal swing. I'm going to hypothesize that that swing is at least partially due to the changing daylight hours. We'll compute the number of hours of daylight and use this to de-trend the data.
Fortunately, my PhD is in Astronomy, so I once-upon-a-time learned how to compute this:
# Define a function which returns the hours of daylight
# given the day of the year, from 0 to 365
def hours_of_daylight(date, axis=23.44, latitude=47.61):
"""Compute the hours of daylight for the given date"""
diff = date - pd.datetime(2000, 12, 21)
day = diff.total_seconds() / 24. / 3600
day %= 365.25
m = 1. - np.tan(np.radians(latitude)) * np.tan(np.radians(axis) * np.cos(day * np.pi / 182.625))
m = max(0, min(m, 2))
return 24. * np.degrees(np.arccos(1 - m)) / 180.
# add this to our weekly data
weekly['daylight'] = map(hours_of_daylight, weekly.index)
daily['daylight'] = map(hours_of_daylight, daily.index)
# Plot the daylight curve
weekly['daylight'].plot()
plt.ylabel('hours of daylight (Seattle)');
This looks reasonable: just over 8 hours of daylight in December, and just under 16 hours in June.
To get a feel for the trend, let's plot the daylight hours versus the weekly bicycle traffic:
plt.scatter(weekly['daylight'], weekly['total'])
plt.xlabel('daylight hours')
plt.ylabel('weekly bicycle traffic');
We see a clear trend, though it's also apparent from the wide vertical scatter that other effects are at play.
Let's apply a linear fit to this data. Basically, we'll draw a best-fit line to the points using some convenient tools in the scikit-learn package, which I've been active in developing:
from sklearn.linear_model import LinearRegression
X = weekly[['daylight']].to_dense()
y = weekly['total']
clf = LinearRegression(fit_intercept=True).fit(X, y)
weekly['daylight_trend'] = clf.predict(X)
weekly['daylight_corrected_total'] = weekly['total'] - weekly['daylight_trend'] + weekly['daylight_trend'].mean()
xfit = np.linspace(7, 17)
yfit = clf.predict(xfit[:, None])
plt.scatter(weekly['daylight'], weekly['total'])
plt.plot(xfit, yfit, '-k')
plt.title("Bicycle traffic through the year")
plt.xlabel('daylight hours')
plt.ylabel('weekly bicycle traffic');
Once such a linear model is fit, we can look at the model coefficients to see, on average, how the change in one variable affects the change in another:
print(clf.coef_[0])
2056.44964989
This tells us that according to this model, each extra hour of daylight leads to about 2000 more riders per week across the bridge! Of course, in Seattle the length of the day also correlates highly with temperature and precipitation; we'll try to untangle those effects later.
Now that we have fit this trend, let's subtract it off and replace it by the mean:
trend = clf.predict(weekly[['daylight']].as_matrix())
plt.scatter(weekly['daylight'], weekly['total'] - trend + np.mean(trend))
plt.plot(xfit, np.mean(trend) + 0 * yfit, '-k')
plt.title("weekly traffic (detrended)")
plt.xlabel('daylight hours')
plt.ylabel('adjusted weekly count');
This is what I mean by "de-trended" data. We've basically removed the component of the data which correlates with the number of hours in a day, so that what is left is in some way agnostic to this quantity. The "adjusted weekly count" plotted here can be thought of as the number of cyclists we'd expect to see if the hours of daylight were not a factor.
Let's visualize this another way. Instead of plotting the number of riders vs daylight hours, we'll again plot the number of riders vs the day of the year, along with the trend:
weekly[['total', 'daylight_trend']].plot()
plt.ylabel("total weekly riders");
We can similarly view the adjusted total number of riders over time by subtracting this green line from the blue line:
weekly['daylight_corrected_total'].plot()
rms = np.std(weekly['daylight_corrected_total'])
plt.ylabel("adjusted total weekly riders")
print("root-mean-square about trend: {0:.0f} riders".format(rms))
root-mean-square about trend: 3100 riders
With the data de-trended, we get a better idea of how bicycling in Seattle has changed over time, corrected for the seasonal variation.
Above we've been looking at weekly data. This is because daily data shows a clear swing as a function of the day of the week, which we'll show here.
days = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
daily['dayofweek'] = daily['total'].index.dayofweek
grouped = daily.groupby('dayofweek')['total'].mean()
grouped.index = days
grouped.plot()
plt.title("Average Traffic By Day")
plt.ylabel("Average Daily Crossings");
As you might expect in a city of bicycle commuters, there is roughly 2.5 times the amount of traffic on weekdays as there is on weekends. Bicycles are not just for entertainment! In Seattle, at least, they are a real means of commuting for thousands of people per day, and the data show this clearly.
Let's de-trend the daily bike counts based on the daily totals. We'll add a variable for each day of the week, and use each of these within the trend (this is an example of what's sometimes known as "one-hot" encoding).
# Add one-hot indicators of weekday
for i in range(7):
daily[days[i]] = (daily.index.dayofweek == i).astype(float)
# de-trend on days of the week and daylight together
X = daily[days + ['daylight']]
y = daily['total']
clf = LinearRegression().fit(X, y)
daily['dayofweek_trend'] = clf.predict(X)
daily[['total', 'dayofweek_trend']].plot();
This shows all the daily bicycle counts (in blue) along with the best-fit trend based on the day of the week and the number of daylight hours per day. It's more clear if we plot the de-trended data:
daily['dayofweek_corrected'] = (daily['total'] - daily['dayofweek_trend'] + daily['dayofweek_trend'].mean())
print("rms = {0:.0f}".format(np.std(daily['dayofweek_corrected'])))
daily['dayofweek_corrected'].plot();
rms = 698
Now we're getting somewhere! What we're seeing here is the number of bicycle crossings per day, corrected for the daily and annual trends. In other words, this is what we might expect the data to look like if the day of the week and the hours of light per day did not matter.
Let's continue on this line of reasoning, and add some more information to the model.
Next we'll account for the NCDC climate data that was described above. The data includes the daily maximum and minimum temperatures, as well as the amount of recorded rainfall at SeaTac airport, about 15 miles to the south. Let's take a look at this data:
# Read the weather file
weather = pd.read_csv('SeaTacWeather.csv', index_col='DATE', parse_dates=True, usecols=[2, 3, 6, 7])
# temperatures are in 1/10 deg C; convert to F
weather['TMIN'] = 0.18 * weather['TMIN'] + 32
weather['TMAX'] = 0.18 * weather['TMAX'] + 32
# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254
weather['TMIN'].resample('w', 'min').plot()
weather['TMAX'].resample('w', 'max').plot()
plt.ylabel('Weekly Temperature Extremes (F)');
plt.title("Temperature Extremes in Seattle");