The MBTA’s GTFS Feed consists of a series of text files in a standard format that represent the system’s static schedule (along with other geographic information). This data ultimately underlies the schedule data available through the MBTA's V3 API, and is referenced by data included in our GTFS-realtime feeds (not covered in this guide).
Throughout this guide, you will gain familiarity with the data tables included in the feed, organized by their corresponding API resources. For the purposes of this guide, we will be using the GTFS Feed active on October 14, 2020.
In the following sections, you will cover the these concepts:
At the end of each section, there will be a question to test your understanding. If you'd like to see an example solution, click the ellipsis under "Reveal Solution"
To open a table of contents, click this button in the sidebar.
You can simply follow along with this guide and run the sample code by selecting code cells and pressing the "Run" button in the toolbar or pressing Shift + Return
. To learn more about the Jupyter Notebook and its ecoystem, see its documentation.
A SQLite database with tables corresponding to each table in the MBTA GTFS Documenation has been loaded for your convenience. Data exploration will be performed in SQL throughout this guide.
## Select this cell and press Shift + Return
%load_ext sql
%sql sqlite:///feed.db
You can now execute SQL queries by adding %%sql
to the top of code cells.
## Select this cell and press Shift + Return
%sql SELECT * FROM feed_info;
Note: While SQL is convenient to use for exploring tabular data like the GTFS Feed, it's important to note that it is not required. In fact, SQL is not actually used anywhere for creating or querying our GTFS feed in our various systems. You should feel free to learn the underlying concepts and explore the data using any tool with which you feel comfortable.
Additionally, you can execute arbitrary Python code if you would rather explore the files that way. The pandas
library has been imported for your convenience.
## Select this cell and press Shift + Return
import pandas as pd
print(2 + 2)
The raw .txt
file are available to download or inspect in the feed
directory. For example, see /feed/feed_info.txt. Feel free to explore these files in Excel, Google Sheets, or your favorite spreadsheet software.
A set of dates on which trips run for one or more routes
GTFS Tables: calendar
, calendar_dates
, calendar_attributes
MBTA V3 API Resource: Service
When a rider is trying to figure out if and when a particular route is running trips, they probably have to ask themselves a few questions:
As an example, let's look at the 89 bus using MBTA.com's Schedule Finder, or the PDF schedule. On MBTA.com, note the four different schedules to choose from (pictured below):
As you can see in the Schedule Finder and PDF schedules, the major difference between these services is in the number and frequency of scheduled trips, though in some cases the pattern, or combination of stops, can also vary. If you've ever heard someone refer to the "Fall weekday schedule" or a "Holiday schedule," they are referring to this concept of a service.
So how do we know on which dates these services are active and running trips, and how is this represented in our GTFS data? Through a combination of the data in the calendar
, calendar_attributes
, and calendar_dates
tables.
First, let's take a look at the first few rows of the calendar
table.
%%sql
SELECT *
FROM calendar
LIMIT 3;
The service_id
is just a unique identifier for a service.
For each column representing a day of week (monday
, tuesday
, etc.), a value of 1
indicates that the service does run trips on that day of week, and a value of 0
indicates that the service does not run trips on that day of week.
The start_date
and end_date
fields indicate the date range for which the service is active.
So looking at service BUS420-10-Wdy-02
, we can say it runs trips on weekdays only from October 2, 2020 to December 18, 2020.
So if we wanted to find services running trips only on weekdays, we could run this query:
%%sql
SELECT *
FROM calendar
WHERE monday = 1
AND tuesday = 1
AND wednesday = 1
AND thursday = 1
AND friday = 1
AND saturday = 0
AND sunday = 0
ORDER BY service_id
LIMIT 3;
And if we want to find services ending before December 1, 2020, we could run this query:
%%sql
SELECT *
FROM calendar
WHERE end_date < "20201201"
ORDER BY service_id
LIMIT 3;
We include additional data about services in the calendar_attributes
table. Let's take a look at the first few rows.
%%sql
SELECT *
FROM calendar_attributes
LIMIT 3;
As you can see, this table includes a service_id
column, which corresponds to a row in the calendar
table, so we can combine these two tables to view human readable names and descriptions for each service.
%%sql
SELECT calendar.service_id, calendar_attributes.service_schedule_name, calendar_attributes.service_description
FROM calendar
JOIN calendar_attributes on calendar.service_id = calendar_attributes.service_id
LIMIT 3;
Rows in the calendar_dates
table define exceptions to the default service patterns defined in the calendar
table. Let's take a look at the first few rows.
%%sql
SELECT *
FROM calendar_dates
LIMIT 10;
Like calendar_attributes
, the service_id
in calendar_dates
corresponds to a row in the calendar
table.
An exception_type
of 1
indicates that we have added the service and its trips on the specified date
.
An exception_type
of 2
indicates that we have removed the service and its trips on the specified date
.
As seen above, on October 16, 2020, we added service RTL420-5-Wdy-01-OrgSkpNrhStnNrd
and removed service RTL420-5-Wdy-01
.
What is the service_id
, service_description
, and end_date
for the Saturday service which has the earliest end date?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT c.service_id, ca.service_description, c.end_date
FROM calendar as c
JOIN calendar_attributes as ca
ON c.service_id = ca.service_id
WHERE monday = 0
AND tuesday = 0
AND wednesday = 0
AND thursday = 0
AND friday = 0
AND saturday = 1
AND sunday = 0
ORDER BY end_date
LIMIT 1;
Which service has the most days added to the feed as exceptions?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT service_id, COUNT(*) as added_days
FROM calendar_dates
WHERE exception_type = 1
GROUP BY service_id
ORDER BY added_days DESC
LIMIT 1;
So now that we can identify which services are running trips on a given date, how can we identify which trips for which routes are running?
%%sql
SELECT *
FROM routes
LIMIT 5;
route_id
is a unique identifier for individual routes, which is referenced in other tables, much like service_id
is referenced in calendar_attributes
and calendar_dates
.
route_desc
categorizes a route's level of service. Possible values for the MBTA implementation include:
Commuter Rail
Rapid Transit
Local Bus
Key Bus
Supplemental Bus
Community Bus
Commuter Bus
Ferry
Rail Replacement Bus
Let's look at the entry for the 89 bus.
%%sql
SELECT *
FROM routes
WHERE route_id = "89";
How many routes are categorized as Local Bus
?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT COUNT(*)
FROM routes
WHERE route_desc = "Local Bus";
Each individual route has one entry in the routes
table. But you may be wondering, how do we represent the different patterns a route may take? We know a trip on a route might vary by a direction (e.g. Inbound vs Outbound) or in the pattern of stops the trip might take, like the 89 bus.
How do we know which patterns exist for a given route, and which pattern a given trip follows? The simplest way is to identify a given trip's route_pattern
.
Subsets of a route, representing different possible patterns of where trips may serve
GTFS Tables: route_patterns, directions
MBTA V3 API Resource: RoutePattern
Let's take a look at the route_patterns
for the 89 bus.
%%sql
SELECT *
FROM route_patterns
WHERE route_id = "89";
Here, you can see four distinct patterns for the Orange Line:
89-2-0
for the Sullivan Sq to Davis Station pattern89-_-0
for the Sullivan Sq to Clarendon Hill pattern89-2-1
for the Davis Square Station to Sullivan Sq pattern89-_-1
for the Clarendon Hill to Sullivan Sq patternYou can also see that each pattern has a direction_id
, either 0 or 1, representing the direction of traveled on a trip. While you could infer trip's the final destination given the route_pattern_name
, (or a given trip's trip_headsign
), a pattern's direction_id
refers to an additional passenger-facing name documented in the directions
table.
Let's take a look at the directions for the 89 bus.
%%sql
SELECT *
FROM directions
WHERE route_id = "89";
Here, we can see that the 89-2-0
and 89-_-0
patterns could also be refered to as "Outbound" and the 89-2-1
and 89-_-1
patterns could also be refered to as "Inbound."
With the combination of service_id
, route_id
, route_pattern_id
, and direction_id
, we can tell which routes are running which trips on a given date.
Which routes have exactly two patterns?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT route_id, COUNT(*) as n_patterns
FROM route_patterns
GROUP BY route_id
HAVING COUNT(*) = 2;
%%sql
SELECT *
FROM trips
LIMIT 3;
Once again, we see a service_id
column corresponding to a row in the calendar
table.
Additionally, we see:
trip_id
: a unique identifier for the triproute_id
: a unique identifier for a route in the routes
table, which contains data about each routeroute_pattern_id
: a unique identifier for a pattern in the route_patterns
table, which contains data about each combination of stops a route may takeYou may also notice there is no date field. That is because in GTFS, a trip
is not an individual trip as a rider would perceive it (i.e. “the Inbound 89 trip that occurs at 8:39am on October 27th, 2020”), but a trip that occurs once on every date defined for a given service.
Let's take a look at all of the services which include trips by the 89 bus.
%%sql
SELECT DISTINCT service_id
FROM trips
WHERE route_id = "89";
As a reminder, these services can include many trips from many different routes.
%%sql
SELECT COUNT(*)
FROM (
SELECT DISTINCT route_id
FROM trips
WHERE service_id == "FallWeekday"
);
Now, let's take a look at all of the trips for the 89 bus on the FallWeekday
service.
%%sql
SELECT *
FROM trips
WHERE service_id == "FallWeekday" AND route_id = "89";
Finally, let's look at only trips on the Sullivan Sq to Davis Square pattern (recall this is 89-2-0
as seen above).
%%sql
SELECT *
FROM trips
WHERE service_id == "FallWeekday" AND route_pattern_id = "89-2-0";
For each service running 89 trips, how many trips are on the Sullivan Sq to Clarendon Hill pattern?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT service_id, COUNT(*) as n_trips
FROM TRIPS
WHERE route_pattern_id = "89-_-0"
GROUP BY service_id;
When a vehicle should arrive and depart a given stop for a given trip
GTFS Table: stop_times
MBTA V3 API Resource: Schedule
The stop_times
table includes a row for every scheduled stop on every scheduled trip in a GTFS Feed. If that sounds like a lot of rows, it is! This is by far the largest table in the feed. Here, we can see there are 2,019,434 stop_times
in this version of the feed.
%%sql
SELECT COUNT(*)
FROM stop_times;
Let's take a look at the first few rows.
%%sql
SELECT *
FROM stop_times
LIMIT 10;
Each trip_id
corresponds to a row in the trips
table, and there should be one row for every stop that trip makes.
The arrival_time
and departure_time
indicate when a vehicle should arrive and depart the stop identified by stop_id
(more on stops later).
The stop_sequence
field indicates the order of stops for a particular trip. The values must increase along the trip but do not need to be consecutive.
So if we have a given trip_id
, we can see all of the stops in order for that trip.
%%sql
SELECT *
FROM stop_times
WHERE trip_id = "45295219"
ORDER BY stop_sequence;
Data about the physical stops referenced by stop_id
can be found in the stops
table.
At what time should the trip 45692465
arrive at its final stop?
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT arrival_time
FROM stop_times
WHERE trip_id = "45692465"
AND stop_sequence = (
SELECT MAX(stop_sequence)
FROM stop_times
WHERE trip_id = "45692465"
GROUP BY trip_id
);
%%sql
SELECT *
FROM stops
LIMIT 10;
This information can be joined with the stop_times
table to construct a readable schedule for a given trip.
%%sql
SELECT st.trip_id as trip_id,
st.arrival_time as arrival_time,
st.departure_time as departure_time,
st.stop_id as stop_id,
st.stop_sequence as stop_sequence,
s.stop_name as stop_name
FROM stop_times as st
JOIN stops as s
ON st.stop_id = s.stop_id
WHERE trip_id = "45771950"
ORDER BY stop_sequence;
Display the following fields for the last trip of the day for the 89 bus on the Sullivan Sq to Clarendon Hill pattern on Fall weekdays:
stop_id
stop_name
arrival_time
departure_time
%%sql
/* Your code here */
Reveal solution
%%sql
SELECT st.stop_id as stop_id, s.stop_name as stop_name, MAX(st.arrival_time) as arrival_time, MAX(st.departure_time) as departure_time
FROM stop_times as st
JOIN stops as s
ON st.stop_id == s.stop_id
JOIN trips as t
ON st.trip_id == t.trip_id
WHERE t.route_id == "89"
AND t.service_id == "FallWeekday"
AND t.route_pattern_id == "89-_-1"
GROUP BY st.stop_sequence, s.stop_name
ORDER BY stop_sequence
Open up a new notebook and explore the current feed!