#!/usr/bin/env python # coding: utf-8 # # GTFS Onboarding Exercise # ## Introduction # # The [MBTA’s GTFS Feed](https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs.md) consists of a series of [text files in a standard format](https://developers.google.com/transit/gtfs/reference) that represent the system’s static schedule (along with other geographic information). This data ultimately underlies the schedule data available through the [MBTA's V3 API](https://www.mbta.com/developers/v3-api), and is referenced by data included in our [GTFS-realtime](https://www.mbta.com/developers/gtfs-realtime) feeds (not covered in this guide). # # Throughout this guide, you will gain familiarity with the data tables included in the feed, organized by their corresponding API resources. For the purposes of this guide, we will be using the GTFS Feed active on [October 14, 2020](https://cdn.mbta.com/archive/archived_feeds.txt). # # In the following sections, you will cover the these concepts: # * [Services](#Services): A set of dates on which trips run for one or more routes # * [Routes](#Routes): Data about individual routes # * [Route Patterns](#Route-Patterns): Subsets of a route, representing different possible patterns of where trips may serve # * [Trips](#Trips): The journey of a particular vehicle through a set of stops # * [Schedules](#Schedules): When a vehicle should arrive and depart a given stop for a given trip # * [Stops](#Stops): Details for a specific stop # # At the end of each section, there will be a question to test your understanding. If you'd like to see an example solution, click the ellipsis under "Reveal Solution" # # ![](img/reveal.png) # # To open a table of contents, click this button in the sidebar. # # ![](img/toc.png) # # ### Resources # * [GTFS Documentation](https://github.com/google/transit/blob/master/gtfs/spec/en/reference.md) # * [MBTA GTFS Documentation](https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs.md) # * [MBTA V3 API Documentation](https://api-v3.mbta.com/docs/swagger/index.html) # ## Getting Oriented # ### Jupyter Notebook # You can simply follow along with this guide and run the sample code by selecting code cells and pressing the "Run" button in the toolbar or pressing `Shift + Return`. To learn more about the Jupyter Notebook and its ecoystem, see its [documentation](https://jupyter.org/documentation). # ### SQL # A SQLite database with tables corresponding to each table in the [MBTA GTFS Documenation](https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs.md) has been loaded for your convenience. Data exploration will be performed in SQL throughout this guide. # In[ ]: ## Select this cell and press Shift + Return get_ipython().run_line_magic('load_ext', 'sql') get_ipython().run_line_magic('sql', 'sqlite:///feed.db') # You can now execute SQL queries by adding `%%sql` to the top of code cells. # In[ ]: ## Select this cell and press Shift + Return get_ipython().run_line_magic('sql', 'SELECT * FROM feed_info;') # **Note**: While SQL is convenient to use for exploring tabular data like the GTFS Feed, it's important to note that it is not required. In fact, SQL is not _actually_ used anywhere for creating or querying our GTFS feed in our various systems. You should feel free to learn the underlying concepts and explore the data using any tool with which you feel comfortable. # ### Python # Additionally, you can execute arbitrary Python code if you would rather explore the files that way. The `pandas` library has been imported for your convenience. # In[ ]: ## Select this cell and press Shift + Return import pandas as pd print(2 + 2) # ### Raw Data # The raw `.txt` file are available to download or inspect in the `feed` directory. For example, see [/feed/feed_info.txt](feed/feed_info.txt). Feel free to explore these files in Excel, Google Sheets, or your favorite spreadsheet software. # ## Services # *A set of dates on which trips run for one or more routes* # # **GTFS Tables**: [`calendar`](feed/calendar.txt), [`calendar_dates`](feed/calendar_dates.txt), [`calendar_attributes`](feed/calendar_attributes.txt) # # **MBTA V3 API Resource**: [`Service`](https://api-v3.mbta.com/docs/swagger/index.html#/Service/ApiWeb_ServiceController_index) # # When a rider is trying to figure out if and when a particular route is running trips, they probably have to ask themselves a few questions: # * What is the day of week? # * What's the date? # * Is it a holiday? # # As an example, let's look at the 89 bus using MBTA.com's [Schedule Finder](https://www.mbta.com/schedules/89/line?schedule_direction%5Bdirection_id%5D=0&schedule_direction%5Borigin%5D=place-sull), or the [PDF schedule](https://cdn.mbta.com/sites/default/files/route_pdfs/2020-fall/R089.pdf). On MBTA.com, note the four different schedules to choose from (pictured below): # * Weekday Schedule (Fall) # * Saturday Schedule (Fall) # * Sunday Schedule (Fall) # * Thanksgiving Day, Nov 26 # # ![](img/services.png) # # As you can see in the Schedule Finder and PDF schedules, the major difference between these services is in the number and frequency of scheduled trips, though in some cases the pattern, or combination of stops, can also vary. If you've ever heard someone refer to the "Fall weekday schedule" or a "Holiday schedule," they are referring to this concept of a service. # # So how do we know on which dates these services are active and running trips, and how is this represented in our GTFS data? Through a combination of the data in the `calendar`, `calendar_attributes`, and `calendar_dates` tables. # ### calendar # First, let's take a look at the first few rows of the `calendar` table. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM calendar\nLIMIT 3;\n') # The `service_id` is just a unique identifier for a service. # # For each column representing a day of week (`monday`, `tuesday`, etc.), a value of `1` indicates that the service **does** run trips on that day of week, and a value of `0` indicates that the service **does not** run trips on that day of week. # # The `start_date` and `end_date` fields indicate the date range for which the service is active. # # So looking at service `BUS420-10-Wdy-02`, we can say it runs trips on weekdays only from October 2, 2020 to December 18, 2020. # # So if we wanted to find services running trips only on weekdays, we could run this query: # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM calendar\nWHERE monday = 1\nAND tuesday = 1\nAND wednesday = 1\nAND thursday = 1\nAND friday = 1\nAND saturday = 0\nAND sunday = 0\nORDER BY service_id\nLIMIT 3;\n') # And if we want to find services ending before December 1, 2020, we could run this query: # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM calendar\nWHERE end_date < "20201201"\nORDER BY service_id\nLIMIT 3;\n') # ### calendar_attributes # We include additional data about services in the `calendar_attributes` table. Let's take a look at the first few rows. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM calendar_attributes\nLIMIT 3;\n') # As you can see, this table includes a `service_id` column, which corresponds to a row in the `calendar` table, so we can combine these two tables to view human readable names and descriptions for each service. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT calendar.service_id, calendar_attributes.service_schedule_name, calendar_attributes.service_description\nFROM calendar\nJOIN calendar_attributes on calendar.service_id = calendar_attributes.service_id\nLIMIT 3;\n') # ### calendar_dates # Rows in the `calendar_dates` table define exceptions to the default service patterns defined in the `calendar` table. Let's take a look at the first few rows. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM calendar_dates\nLIMIT 10;\n') # Like `calendar_attributes`, the `service_id` in `calendar_dates` corresponds to a row in the `calendar` table. # # An `exception_type` of `1` indicates that we have added the service and its trips on the specified `date`. # # An `exception_type` of `2` indicates that we have removed the service and its trips on the specified `date`. # # As seen above, on October 16, 2020, we added service `RTL420-5-Wdy-01-OrgSkpNrhStnNrd` and removed service `RTL420-5-Wdy-01`. # ### Test Your Knowledge # What is the `service_id`, `service_description`, and `end_date` for the Saturday service which has the earliest end date? # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', '\nSELECT c.service_id, ca.service_description, c.end_date\nFROM calendar as c\nJOIN calendar_attributes as ca\nON c.service_id = ca.service_id\nWHERE monday = 0\nAND tuesday = 0\nAND wednesday = 0\nAND thursday = 0\nAND friday = 0\nAND saturday = 1\nAND sunday = 0\nORDER BY end_date\nLIMIT 1;\n\n') # Which service has the most days added to the feed as exceptions? # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT service_id, COUNT(*) as added_days\nFROM calendar_dates\nWHERE exception_type = 1\nGROUP BY service_id\nORDER BY added_days DESC\nLIMIT 1;\n') # So now that we can identify which services are running trips on a given date, how can we identify *which* trips for *which* routes are running? # ## Routes # *Data about individual routes* # # **GTFS Table:** [routes](feed/routes.txt) # # **MBTA V3 API Resource**: [Route](https://api-v3.mbta.com/docs/swagger/index.html#/Route/ApiWeb_RouteController_show) # # First, let's take a look at the first few rows of the `routes` table. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM routes\nLIMIT 5;\n') # `route_id` is a unique identifier for individual routes, which is referenced in other tables, much like `service_id` is referenced in `calendar_attributes` and `calendar_dates`. # # `route_desc` categorizes a route's level of service. Possible values for the MBTA implementation include: # * `Commuter Rail` # * `Rapid Transit` # * `Local Bus` # * `Key Bus` # * `Supplemental Bus` # * `Community Bus` # * `Commuter Bus` # * `Ferry` # * `Rail Replacement Bus` # # Let's look at the entry for the 89 bus. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM routes\nWHERE route_id = "89";\n') # ### Test Your Knowledge # How many routes are categorized as `Local Bus`? # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*)\nFROM routes\nWHERE route_desc = "Local Bus";\n') # Each individual route has one entry in the `routes` table. But you may be wondering, how do we represent the different patterns a route may take? We know a trip on a route might vary by a direction (e.g. Inbound vs Outbound) or in the pattern of stops the trip might take, like the 89 bus. # # ![](img/rp.png) # # How do we know which patterns exist for a given route, and which pattern a given trip follows? The simplest way is to identify a given trip's `route_pattern`. # ## Route Patterns # *Subsets of a route, representing different possible patterns of where trips may serve* # # **GTFS Tables**: [route_patterns](feed/route_patterns.txt), [directions](feed/directions.txt) # # **MBTA V3 API Resource**: [RoutePattern](https://api-v3.mbta.com/docs/swagger/index.html#/RoutePattern/ApiWeb_RoutePatternController_show) # ### route_patterns # Let's take a look at the `route_patterns` for the 89 bus. # In[ ]: get_ipython().run_cell_magic('sql', '', '\nSELECT * \nFROM route_patterns\nWHERE route_id = "89";\n') # Here, you can see four distinct patterns for the Orange Line: # * `89-2-0` for the Sullivan Sq to Davis Station pattern # * `89-_-0` for the Sullivan Sq to Clarendon Hill pattern # * `89-2-1` for the Davis Square Station to Sullivan Sq pattern # * `89-_-1` for the Clarendon Hill to Sullivan Sq pattern # # You can also see that each pattern has a `direction_id`, either 0 or 1, representing the direction of traveled on a trip. While you could infer trip's the final destination given the `route_pattern_name`, (or a given trip's `trip_headsign`), a pattern's `direction_id` refers to an additional passenger-facing name documented in the `directions` table. # ### directions # # Let's take a look at the directions for the 89 bus. # In[ ]: get_ipython().run_cell_magic('sql', '', '\nSELECT *\nFROM directions\nWHERE route_id = "89";\n') # Here, we can see that the `89-2-0` and `89-_-0` patterns could also be refered to as "Outbound" and the `89-2-1` and `89-_-1` patterns could also be refered to as "Inbound." # # With the combination of `service_id`, `route_id`, `route_pattern_id`, and `direction_id`, we can tell which routes are running which trips on a given date. # ### Test Your Knowledge # Which routes have exactly two patterns? # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT route_id, COUNT(*) as n_patterns\nFROM route_patterns\nGROUP BY route_id\nHAVING COUNT(*) = 2;\n') # ## Trips # *The journey of a particular vehicle through a set of stops* # # **GTFS Table**: [trips](feed/trips.txt) # # **MBTA V3 API Resource**: [`Trip`](https://api-v3.mbta.com/docs/swagger/index.html#/Trip/ApiWeb_TripController_show) # # The `trips` table contains data for each individual trip in a GTFS feed. Let's look at the first few rows. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM trips\nLIMIT 3;\n') # Once again, we see a `service_id` column corresponding to a row in the `calendar` table. # # Additionally, we see: # * `trip_id`: a unique identifier for the trip # * `route_id`: a unique identifier for a route in the `routes` table, which contains data about each route # * `route_pattern_id`: a unique identifier for a pattern in the `route_patterns` table, which contains data about each combination of stops a route may take # # You may also notice there is no date field. That is because in GTFS, a `trip` is not an individual trip as a rider would perceive it (i.e. “the Inbound 89 trip that occurs at 8:39am on October 27th, 2020”), but a trip that occurs once on every date defined for a given service. # # Let's take a look at all of the services which include trips by the 89 bus. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT DISTINCT service_id\nFROM trips\nWHERE route_id = "89";\n') # As a reminder, these services can include many trips from many different routes. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*)\nFROM (\n SELECT DISTINCT route_id\n FROM trips\n WHERE service_id == "FallWeekday"\n);\n') # Now, let's take a look at all of the trips for the 89 bus on the `FallWeekday` service. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM trips\nWHERE service_id == "FallWeekday" AND route_id = "89";\n') # Finally, let's look at only trips on the Sullivan Sq to Davis Square pattern (recall this is `89-2-0` as seen above). # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM trips\nWHERE service_id == "FallWeekday" AND route_pattern_id = "89-2-0";\n') # ### Test Your Knowledge # For each service running 89 trips, how many trips are on the Sullivan Sq to Clarendon Hill pattern? # In[ ]: get_ipython().run_cell_magic('sql', '', '/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT service_id, COUNT(*) as n_trips\nFROM TRIPS\nWHERE route_pattern_id = "89-_-0"\nGROUP BY service_id;\n') # ## Schedules # *When a vehicle should arrive and depart a given stop for a given trip* # # **GTFS Table**: [`stop_times`](feed/stop_times.txt) # # **MBTA V3 API Resource**: [`Schedule`](https://api-v3.mbta.com/docs/swagger/index.html#/Schedule/ApiWeb_ScheduleController_index) # # The `stop_times` table includes a row for every scheduled stop on every scheduled trip in a GTFS Feed. If that sounds like a lot of rows, it is! This is by far the largest table in the feed. Here, we can see there are 2,019,434 `stop_times` in this version of the feed. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*)\nFROM stop_times;\n') # Let's take a look at the first few rows. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM stop_times\nLIMIT 10;\n') # Each `trip_id` corresponds to a row in the `trips` table, and there should be one row for every stop that trip makes. # # The `arrival_time` and `departure_time` indicate when a vehicle should arrive and depart the stop identified by `stop_id` (more on stops later). # # The `stop_sequence` field indicates the order of stops for a particular trip. The values must increase along the trip but do not need to be consecutive. # # So if we have a given `trip_id`, we can see all of the stops in order for that trip. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM stop_times\nWHERE trip_id = "45295219"\nORDER BY stop_sequence;\n') # Data about the physical stops referenced by `stop_id` can be found in the `stops` table. # ### Test Your Knowledge # At what time should the trip `45692465` arrive at its final stop? # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT arrival_time\nFROM stop_times\nWHERE trip_id = "45692465"\nAND stop_sequence = (\n SELECT MAX(stop_sequence)\n FROM stop_times\n WHERE trip_id = "45692465"\n GROUP BY trip_id\n);\n') # ## Stops # *Details for a specific stop* # # **GTFS Table**: [`stops`](feed/stops.txt) # # **MBTA V3 API Resource**: [`Stop`](https://api-v3.mbta.com/docs/swagger/index.html#/Stop/ApiWeb_StopController_show) # # Let's take a look at the first few rows of the `stops` table. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT *\nFROM stops\nLIMIT 10;\n') # This information can be joined with the `stop_times` table to construct a readable schedule for a given trip. # In[ ]: get_ipython().run_cell_magic('sql', '', 'SELECT st.trip_id as trip_id, \n st.arrival_time as arrival_time, \n st.departure_time as departure_time, \n st.stop_id as stop_id,\n st.stop_sequence as stop_sequence,\n s.stop_name as stop_name\nFROM stop_times as st\nJOIN stops as s \nON st.stop_id = s.stop_id\nWHERE trip_id = "45771950"\nORDER BY stop_sequence;\n') # ### Test Your Knowledge # Display the following fields for the last trip of the day for the 89 bus on the Sullivan Sq to Clarendon Hill pattern on Fall weekdays: # * `stop_id` # * `stop_name` # * `arrival_time` # * `departure_time` # In[ ]: get_ipython().run_cell_magic('sql', '', '\n/* Your code here */\n') # Reveal solution # In[ ]: get_ipython().run_cell_magic('sql', '', '\nSELECT st.stop_id as stop_id, s.stop_name as stop_name, MAX(st.arrival_time) as arrival_time, MAX(st.departure_time) as departure_time\nFROM stop_times as st\nJOIN stops as s\nON st.stop_id == s.stop_id\nJOIN trips as t\nON st.trip_id == t.trip_id\nWHERE t.route_id == "89"\nAND t.service_id == "FallWeekday"\nAND t.route_pattern_id == "89-_-1"\nGROUP BY st.stop_sequence, s.stop_name\nORDER BY stop_sequence\n') # ## What's next? # Open up a [new notebook](GTFS_Exploration.ipynb) and explore the current feed!