Inspired by Geospatial Operations at Scale with Dask we're going to look at a the NYC Taxi dataset from a geospatial angle. This also helps us prove out recent performance improvements to GeoPandas (which was recently cythonized) and a new dask-geopandas parallel implementation.
from dask.distributed import Client
client = Client(processes=False)
client
Client
|
Cluster
|
We will segment our data by the official taxi zones available here.
import geopandas as gpd
zones = gpd.read_file('taxi_zones.shp').to_crs({'init' :'epsg:4326'})
zones['zone'] = zones.zone.astype('category')
zones['borough'] = zones.borough.astype('category')
zones.head()
I am densified (external_values, 263 elements) I am densified (5 elements) I am densified (5 elements)
OBJECTID | Shape_Leng | Shape_Area | zone | LocationID | borough | geometry | |
---|---|---|---|---|---|---|---|
0 | 1 | 0.116357 | 0.000782 | Newark Airport | 1 | EWR | POLYGON ((-74.18445299999996 40.6949959999999,... |
1 | 2 | 0.433470 | 0.004866 | Jamaica Bay | 2 | Queens | (POLYGON ((-73.82337597260663 40.6389870471767... |
2 | 3 | 0.084341 | 0.000314 | Allerton/Pelham Gardens | 3 | Bronx | POLYGON ((-73.84792614099985 40.87134223399991... |
3 | 4 | 0.043567 | 0.000112 | Alphabet City | 4 | Manhattan | POLYGON ((-73.97177410965318 40.72582128133705... |
4 | 5 | 0.092146 | 0.000498 | Arden Heights | 5 | Staten Island | POLYGON ((-74.17421738099989 40.56256808599987... |
We can easily plot with matplotlib, which is builtin to geopandas
%matplotlib inline
zones.plot(column='borough', categorical=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5762442668>
But for most of this notebook we'll be using Bokeh
from bokeh.io import output_notebook
from bokeh.models import GeoJSONDataSource, HoverTool, CategoricalColorMapper, LinearColorMapper
from bokeh.plotting import figure, show
from bokeh.palettes import Category10
output_notebook()
geo_source = GeoJSONDataSource(geojson=zones.to_json())
factors = zones.borough.drop_duplicates()
color_mapper = CategoricalColorMapper(factors=factors.tolist(), palette=Category10[10])
fig = figure()
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source,
color={'field': 'borough', 'transform': color_mapper},
)
hover = HoverTool(
point_policy='follow_mouse',
tooltips='<p><b>Borough</b>: @borough</p><p><b>Zone</b>: @zone</p>'
)
fig.add_tools(hover)
fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False
show(fig)
We read a few columns of the 2015 NYC Taxi and Limousine Commission data with dask.dataframe, and then perform a spatial join with dask-geopandas. Finally we write the result to the parquet format for future analysis.
We'll hold onto only those columns that we need for future computations.
import dask.dataframe as dd
import dask_geopandas as dg
import pandas as pd
import dask
columns = ['pickup_latitude', 'pickup_longitude', 'passenger_count',
'fare_amount', 'tip_amount', 'payment_type']
df = dd.read_csv('csv/yellow_tripdata_2015-*.csv', usecols=columns)
gf = df.set_geometry(df[['pickup_longitude', 'pickup_latitude']],
crs={'init' :'epsg:4326'})
gf2 = dg.sjoin(gf, zones[['zone', 'borough', 'geometry']])
full = gf2[['zone', 'passenger_count', 'fare_amount', 'tip_amount', 'payment_type']]
full['payment_type'] = full.payment_type.astype('u1')
full.to_parquet('nyc-zones.parquet')
full = dd.read_parquet('nyc-zones.parquet')
Warning: CRS does not match
full.head()
zone | passenger_count | fare_amount | tip_amount | payment_type | |
---|---|---|---|---|---|
0 | Penn Station/Madison Sq West | 1 | 12.0 | 3.25 | 1 |
1 | SoHo | 1 | 14.5 | 2.00 | 1 |
2 | Morningside Heights | 1 | 9.5 | 0.00 | 2 |
3 | TriBeCa/Civic Center | 1 | 3.5 | 0.00 | 2 |
4 | Midtown North | 1 | 15.0 | 0.00 | 2 |
Lets start off by reproducing Ravi's plot from his blogpost.
This computes the number of rides in each taxi zone. We plot these on a log scale because the number of rides is so disparate in different parts of the city. Zones in downtown Manhattan have millions of rides per year while some larger zones in Staten Island have only a few tens of rides.
result = full.passenger_count.groupby(full.zone).count().compute()
result.name = 'count'
joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))
figures = {}
from bokeh.palettes import viridis
from bokeh.models import ColorBar, LogColorMapper
geo_source = GeoJSONDataSource(geojson=joined.to_json())
color_mapper = LogColorMapper(viridis(256))
fig = figure(title='Number of rides')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source,
color={'field': 'count', 'transform': color_mapper},
line_width=1, line_alpha=0.5, line_color='black')
hover = HoverTool(
point_policy='follow_mouse',
tooltips=('<div><b>Borough</b>: @borough</div>'
'<div><b>Zone</b>: @zone</div>'
'<div><b>Rides</b>: @count</div>')
)
fig.add_tools(hover)
color_bar = ColorBar(color_mapper=color_mapper,
location=(0, 0),
label_standoff=12)
fig.add_layout(color_bar, 'right')
fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False
figures['ride-counts'] = fig
show(fig)
W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='72e74163-642d-4682-b0c4-b9766bf7ccd8', ...)
Regardless of the number of rides we can still compute the average tip fraction spatially over the city. We find generally the same 20% tip throughout the city, with oddly high percentages in the West Village and Dyker Heights.
df = full[(full.tip_amount > 0) & (full.fare_amount > 0)]
tips = df.tip_amount.groupby(df.zone).sum()
fares = df.fare_amount.groupby(df.zone).sum()
result = tips / fares
result = result.compute()
result.name = 'tip_fraction'
joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))
joined.head()
I am densified (5 elements) I am densified (5 elements)
tip_fraction | OBJECTID | Shape_Leng | Shape_Area | zone | LocationID | borough | geometry | |
---|---|---|---|---|---|---|---|---|
2 | 0.275410 | 3 | 0.084341 | 0.000314 | Allerton/Pelham Gardens | 3 | Bronx | POLYGON ((-73.84792614099985 40.87134223399991... |
3 | 0.201290 | 4 | 0.043567 | 0.000112 | Alphabet City | 4 | Manhattan | POLYGON ((-73.97177410965318 40.72582128133705... |
4 | 0.238670 | 5 | 0.092146 | 0.000498 | Arden Heights | 5 | Staten Island | POLYGON ((-74.17421738099989 40.56256808599987... |
5 | 0.350325 | 6 | 0.150491 | 0.000606 | Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON ((-74.06367318899999 40.60219816599994... |
6 | 0.212477 | 7 | 0.107417 | 0.000390 | Astoria | 7 | Queens | POLYGON ((-73.90413637799996 40.76752031699986... |
from bokeh.palettes import viridis
from bokeh.models import ColorBar
geo_source = GeoJSONDataSource(geojson=joined.to_json())
color_mapper = LogColorMapper(viridis(256))
fig = figure(title='Average Tip Fraction')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source,
color={'field': 'tip_fraction', 'transform': color_mapper},
line_width=1, line_alpha=0.5, line_color='black')
hover = HoverTool(
point_policy='follow_mouse',
tooltips=('<div><b>Borough</b>: @borough</div>'
'<div><b>Zone</b>: @zone</div>'
'<div><b>Tip Fraction</b>: @tip_fraction</div>')
)
fig.add_tools(hover)
color_bar = ColorBar(color_mapper=color_mapper,
location=(0, 0),
label_standoff=12)
fig.add_layout(color_bar, 'right')
fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False
figures['tip-fraction'] = fig
show(fig)
W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='ee81443a-4819-45b6-9174-edfc431c1cc2', ...)
Finally lets look at the prevalance of cash tips. We find that the outer boroughs (Brooklyn, Queens, Long Island) tend to prefer cash while Manhattan and the airports tend to prefer credit.
cash = full.payment_type == 2
result = cash.groupby(full.zone).mean().compute()
result.head()
zone Allerton/Pelham Gardens 0.561162 Alphabet City 0.334731 Arden Heights 0.435897 Arrochar/Fort Wadsworth 0.500000 Astoria 0.539167 Name: payment_type, dtype: float64
joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))
joined.head()
I am densified (5 elements) I am densified (5 elements)
payment_type | OBJECTID | Shape_Leng | Shape_Area | zone | LocationID | borough | geometry | |
---|---|---|---|---|---|---|---|---|
2 | 0.561162 | 3 | 0.084341 | 0.000314 | Allerton/Pelham Gardens | 3 | Bronx | POLYGON ((-73.84792614099985 40.87134223399991... |
3 | 0.334731 | 4 | 0.043567 | 0.000112 | Alphabet City | 4 | Manhattan | POLYGON ((-73.97177410965318 40.72582128133705... |
4 | 0.435897 | 5 | 0.092146 | 0.000498 | Arden Heights | 5 | Staten Island | POLYGON ((-74.17421738099989 40.56256808599987... |
5 | 0.500000 | 6 | 0.150491 | 0.000606 | Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON ((-74.06367318899999 40.60219816599994... |
6 | 0.539167 | 7 | 0.107417 | 0.000390 | Astoria | 7 | Queens | POLYGON ((-73.90413637799996 40.76752031699986... |
from bokeh.palettes import viridis, RdBu
from bokeh.models import ColorBar
geo_source = GeoJSONDataSource(geojson=joined.to_json())
color_mapper = LinearColorMapper(RdBu[11])
fig = figure(title='Fraction of rides that tip in cash')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source,
color={'field': 'payment_type', 'transform': color_mapper},
line_width=1, line_alpha=0.5, line_color='black')
hover = HoverTool(
point_policy='follow_mouse',
tooltips=('<div><b>Borough</b>: @borough</div>'
'<div><b>Zone</b>: @zone</div>'
'<div><b>Cash Fraction</b>: @payment_type</div>')
)
fig.add_tools(hover)
color_bar = ColorBar(color_mapper=color_mapper,
location=(0, 0),
label_standoff=12)
fig.add_layout(color_bar, 'right')
fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False
figures['cash-fraction'] = fig
show(fig)
W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='4733bcdf-cf01-4817-828a-f92a7dc3fd03', ...)
I use the code below to embed these plots into my blog.
from bokeh.embed import components
script, divs = components(list(figures.values()))
with open('/home/mrocklin/workspace/blog/_posts/work/2017-09-09-accelerating-geopandas-1.md', 'at') as f:
for div in divs:
f.write(div)
f.write('\n\n')
f.write(script)