This notebook originally appeared as a post on the blog Pythonic Perambulations. The content is BSD licensed.
This week Pronto CycleShare, Seattle's Bicycle Share system, turned one year old. To celebrate this, Pronto made available a large cache of data from the first year of operation and announced the Pronto Cycle Share's Data Challenge, which offers prizes for different categories of analysis.
There are a lot of tools out there that you could use to analyze data like this, but my tool of choice is (obviously) Python. In this post, I want to show how you can get started analyzing this data and joining it with other available data sources using the PyData stack, namely NumPy, Pandas, Matplotlib, and Seaborn. Here I'll take a look at some of the basic questions you can answer with this data. Later I hope to find the time to dig deeper and ask some more interesting and creative questions – stay tuned!
For those who aren't familiar, this post is composed in the form of a Jupyter Notebook, which is an open document format that combines text, code, data, and graphics and is viewable through the web browser – if you have not used it before I encourage you to try it out! You can download the notebook containing this post here, open it with Jupyter, and start asking your own questions of the data.
We'll start by downloading the data (available on Pronto's Website) which you can do by uncommenting the following shell commands (the exclamation mark here is a special IPython syntax to run a shell command). The total download is about 70MB, and the unzipped files are around 900MB.
# !curl -O https://s3.amazonaws.com/pronto-data/open_data_year_one.zip
# !unzip open_data_year_one.zip
Next we need some standard Python package imports:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
And now we load the trip data with Pandas:
trips = pd.read_csv('2015_trip_data.csv',
parse_dates=['starttime', 'stoptime'],
infer_datetime_format=True)
trips.head()
trip_id | starttime | stoptime | bikeid | tripduration | from_station_name | to_station_name | from_station_id | to_station_id | usertype | gender | birthyear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 431 | 2014-10-13 10:31:00 | 2014-10-13 10:48:00 | SEA00298 | 985.935 | 2nd Ave & Spring St | Occidental Park / Occidental Ave S & S Washing... | CBD-06 | PS-04 | Annual Member | Male | 1960 |
1 | 432 | 2014-10-13 10:32:00 | 2014-10-13 10:48:00 | SEA00195 | 926.375 | 2nd Ave & Spring St | Occidental Park / Occidental Ave S & S Washing... | CBD-06 | PS-04 | Annual Member | Male | 1970 |
2 | 433 | 2014-10-13 10:33:00 | 2014-10-13 10:48:00 | SEA00486 | 883.831 | 2nd Ave & Spring St | Occidental Park / Occidental Ave S & S Washing... | CBD-06 | PS-04 | Annual Member | Female | 1988 |
3 | 434 | 2014-10-13 10:34:00 | 2014-10-13 10:48:00 | SEA00333 | 865.937 | 2nd Ave & Spring St | Occidental Park / Occidental Ave S & S Washing... | CBD-06 | PS-04 | Annual Member | Female | 1977 |
4 | 435 | 2014-10-13 10:34:00 | 2014-10-13 10:49:00 | SEA00202 | 923.923 | 2nd Ave & Spring St | Occidental Park / Occidental Ave S & S Washing... | CBD-06 | PS-04 | Annual Member | Male | 1971 |
Each row of this trip dataset is a single ride by a single person, and the data contains over 140,000 rows!
Let's start by looking at the trend in number of daily trips over the course of the year
# Find the start date
ind = pd.DatetimeIndex(trips.starttime)
trips['date'] = ind.date.astype('datetime64')
trips['hour'] = ind.hour
# Count trips by date
by_date = trips.pivot_table('trip_id', aggfunc='count',
index='date',
columns='usertype', )
fig, ax = plt.subplots(2, figsize=(16, 8))
fig.subplots_adjust(hspace=0.4)
by_date.iloc[:, 0].plot(ax=ax[0], title='Annual Members');
by_date.iloc[:, 1].plot(ax=ax[1], title='Day-Pass Users');
This plot shows the daily trend, separated by Annual members (top) and Day-Pass users (bottom). A couple observations:
Let's zoom-in on this weekly trend, by averaging all rides by day of week. Becuase of the change in pattern around January 2015, we'll split the data by both year and by day of week:
by_weekday = by_date.groupby([by_date.index.year,
by_date.index.dayofweek]).mean()
by_weekday.columns.name = None # remove label for plot
fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
by_weekday.loc[2014].plot(title='Average Use by Day of Week (2014)', ax=ax[0]);
by_weekday.loc[2015].plot(title='Average Use by Day of Week (2015)', ax=ax[1]);
for axi in ax:
axi.set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
We see a complementary pattern overall: annual users tend to use their bikes during Monday to Friday (i.e. as part of a commute) while day pass users tend to use their bikes on the weekend. This pattern didn't fully develop until the start of 2015, however, especially for annual members: it seems that for the first couple months, users had not yet adapted their commute habits to make use of Pronto!
It's also quite interesting to view the average hourly trips by weekday and weekend. This takes a bit of manipulation:
# count trips by date and by hour
by_hour = trips.pivot_table('trip_id', aggfunc='count',
index=['date', 'hour'],
columns='usertype').fillna(0).reset_index('hour')
# average these counts by weekend
by_hour['weekend'] = (by_hour.index.dayofweek >= 5)
by_hour = by_hour.groupby(['weekend', 'hour']).mean()
by_hour.index.set_levels([['weekday', 'weekend'],
["{0}:00".format(i) for i in range(24)]],
inplace=True);
by_hour.columns.name = None
Now we can plot the results to see the hourly trends:
fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
by_hour.loc['weekday'].plot(title='Average Hourly Use (Mon-Fri)', ax=ax[0])
by_hour.loc['weekend'].plot(title='Average Hourly Use (Sat-Sun)', ax=ax[1])
ax[0].set_ylabel('Average Trips per Hour');
We see a clear difference between a "commute" pattern, which sharply peaks in the morning and evening (e.g. annual members during weekdays) and a "recreational" pattern, which has a broad peak in the early afternoon (e.g. annual members on weekends, and short-term users all the time). Interestingly, the average behavior of annual pass holders on weekends seems to be almost identical to the average behavior of day-pass users on weekdays!
For those who have read my previous posts, you might recognize this as very similar to the patterns I found in the Fremont Bridge bicycle counts.
Next let's take a look at the durations of trips. Pronto rides are designed to be up to 30 minutes; any single use that is longer than this incurs a usage fee of a couple dollars for the first half hour, and about ten dollars per hour thereafter.
Let's look at the distribution of trip durations for Annual members and short-term pass holders:
trips['minutes'] = trips.tripduration / 60
trips.groupby('usertype')['minutes'].hist(bins=np.arange(61), alpha=0.5, normed=True);
plt.xlabel('Duration (minutes)')
plt.ylabel('relative frequency')
plt.title('Trip Durations')
plt.text(34, 0.09, "Free Trips\n\nAdditional Fee", ha='right',
size=18, rotation=90, alpha=0.5, color='red')
plt.legend(['Annual Members', 'Short-term Pass'])
plt.axvline(30, linestyle='--', color='red', alpha=0.3);
Here I have added a red dashed line separating the free rides (left) from the rides which incur a usage fee (right). It seems that annual users are much more savvy to the system rules: only a small tail of the trip distribution lies beyond 30 minutes. Around one in four Day Pass Rides, on the other hand, are longer than the half hour limit and incur additional fees. My hunch is that these day pass users aren't fully aware of this pricing structure ("I paid for the day, right?") and likely walk away unhappy with the experience.
It's also interesting to look at the distance of the trips. Distances between stations are not included in Pronto's data release, so we need to find them via another source. Let's start by loading the station data – and because some of the trips start and end at Pronto's shop, we'll add this as another "station":
stations = pd.read_csv('2015_station_data.csv')
pronto_shop = dict(id=54, name="Pronto shop",
terminal="Pronto shop",
lat=47.6173156, long=-122.3414776,
dockcount=100, online='10/13/2014')
stations = stations.append(pronto_shop, ignore_index=True)
Now we need to find bicycling distances between pairs of lat/lon coordinates. Fortunately, Google Maps has a distances API that we can use for free.
Reading the fine print, free use is limited to 2500 distances per day, and 100 distances per 10 seconds. With 55 stations we have $55 \times 54 / 2 = 1485$ unique nonzero distances, so we can just query all of them within a few minutes on a single day for free (if we do it carefully).
To do this, we'll query one (partial) row at a time, waiting 10+ seconds between queries (Note: we might also use the googlemaps Python package instead, but it requires obtaining an API key).
from time import sleep
def query_distances(stations=stations):
"""Query the Google API for bicycling distances"""
latlon_list = ['{0},{1}'.format(lat, long)
for (lat, long) in zip(stations.lat, stations.long)]
def create_url(i):
URL = ('https://maps.googleapis.com/maps/api/distancematrix/json?'
'origins={origins}&destinations={destinations}&mode=bicycling')
return URL.format(origins=latlon_list[i],
destinations='|'.join(latlon_list[i + 1:]))
for i in range(len(latlon_list) - 1):
url = create_url(i)
filename = "distances_{0}.json".format(stations.terminal.iloc[i])
print(i, filename)
!curl "{url}" -o {filename}
sleep(11) # only one query per 10 seconds!
def build_distance_matrix(stations=stations):
"""Build a matrix from the Google API results"""
dist = np.zeros((len(stations), len(stations)), dtype=float)
for i, term in enumerate(stations.terminal[:-1]):
filename = 'queried_distances/distances_{0}.json'.format(term)
row = json.load(open(filename))
dist[i, i + 1:] = [el['distance']['value'] for el in row['rows'][0]['elements']]
dist += dist.T
distances = pd.DataFrame(dist, index=stations.terminal,
columns=stations.terminal)
distances.to_csv('station_distances.csv')
return distances
# only call this the first time
import os
if not os.path.exists('station_distances.csv'):
# Note: you can call this function at most ~twice per day!
query_distances()
# Move all the queried files into a directory
# so we don't accidentally overwrite them
if not os.path.exists('queried_distances'):
os.makedirs('queried_distances')
!mv distances_*.json queried_distances
# Build distance matrix and save to CSV
distances = build_distance_matrix()
Here's what the first 5x5 section of the distance matrix looks like:
distances = pd.read_csv('station_distances.csv', index_col='terminal')
distances.iloc[:5, :5]
BT-01 | BT-03 | BT-04 | BT-05 | CBD-13 | |
---|---|---|---|---|---|
terminal | |||||
BT-01 | 0 | 422 | 1067 | 867 | 1342 |
BT-03 | 422 | 0 | 838 | 445 | 920 |
BT-04 | 1067 | 838 | 0 | 1094 | 1121 |
BT-05 | 867 | 445 | 1094 | 0 | 475 |
CBD-13 | 1342 | 920 | 1121 | 475 | 0 |
Let's convert these distances to miles and join them to our trips data:
stacked = distances.stack() / 1609.34 # convert meters to miles
stacked.name = 'distance'
trips = trips.join(stacked, on=['from_station_id', 'to_station_id'])
Now we can plot the distribution of trip distances:
fig, ax = plt.subplots(figsize=(12, 4))
trips.groupby('usertype')['distance'].hist(bins=np.linspace(0, 6.99, 50),
alpha=0.5, ax=ax);
plt.xlabel('Distance between start & end (miles)')
plt.ylabel('relative frequency')
plt.title('Minimum Distance of Trip')
plt.legend(['Annual Members', 'Short-term Pass']);
Keep in mind that this shows the shortest possible distance between stations, and thus is a lower bound on the actual distance ridden on each trip. Many trips (especially for day pass users) begin and end within a few blocks. Beyond this, trips peak at around 1 mile, though some extreme users are pushing their trips out to four or more miles.
Given these distances, we can also compute a lower bound on the estimated riding speed. Let's do this, and then take a look at the distribution of speeds for Annual and Short-term users:
trips['speed'] = trips.distance * 60 / trips.minutes
trips.groupby('usertype')['speed'].hist(bins=np.linspace(0, 15, 50), alpha=0.5, normed=True);
plt.xlabel('lower bound riding speed (MPH)')
plt.ylabel('relative frequency')
plt.title('Rider Speed Lower Bound (MPH)')
plt.legend(['Annual Members', 'Short-term Pass']);
Interestingly, the distributions are quite different, with annual riders showing on average a higher inferred speed. You might be tempted to conclude here that annual members ride faster than day-pass users, but the data alone aren't sufficient to support this conclusion. This data could also be explained if annual users tend to go from point A to point B by the most direct route, while day pass users tend to meander around and get to their destination indirectly. I suspect that the reality is some mix of these two effects.
It is also informative to take a look at the relationship between distance and speed:
g = sns.FacetGrid(trips, col="usertype", hue='usertype', size=6)
g.map(plt.scatter, "distance", "speed", s=4, alpha=0.2)
# Add lines and labels
x = np.linspace(0, 10)
g.axes[0, 0].set_ylabel('Lower Bound Speed')
for ax in g.axes.flat:
ax.set_xlabel('Lower Bound Distance')
ax.plot(x, 2 * x, '--r', alpha=0.3)
ax.text(9.8, 16.5, "Free Trips\n\nAdditional Fee", ha='right',
size=18, rotation=39, alpha=0.5, color='red')
ax.axis([0, 10, 0, 25])