GeoSpatial Analysis of NYC Taxi data¶

Inspired by Geospatial Operations at Scale with Dask we're going to look at a the NYC Taxi dataset from a geospatial angle. This also helps us prove out recent performance improvements to GeoPandas (which was recently cythonized) and a new dask-geopandas parallel implementation.

In [1]:

from dask.distributed import Client
client = Client(processes=False)
client

Out[1]:

Client

Scheduler: inproc://192.168.50.100/25890/1
Dashboard: http://localhost:8787

Cluster

Workers: 1
Cores: 4
Memory: 16.69 GB

Load Taxi Zone data with GeoPandas¶

We will segment our data by the official taxi zones available here.

In [2]:

import geopandas as gpd
zones = gpd.read_file('taxi_zones.shp').to_crs({'init' :'epsg:4326'})
zones['zone'] = zones.zone.astype('category')
zones['borough'] = zones.borough.astype('category')
zones.head()

I am densified (external_values, 263 elements)
I am densified (5 elements)
I am densified (5 elements)

Out[2]:

	OBJECTID	Shape_Leng	Shape_Area	zone	LocationID	borough	geometry
0	1	0.116357	0.000782	Newark Airport	1	EWR	POLYGON ((-74.18445299999996 40.6949959999999,...
1	2	0.433470	0.004866	Jamaica Bay	2	Queens	(POLYGON ((-73.82337597260663 40.6389870471767...
2	3	0.084341	0.000314	Allerton/Pelham Gardens	3	Bronx	POLYGON ((-73.84792614099985 40.87134223399991...
3	4	0.043567	0.000112	Alphabet City	4	Manhattan	POLYGON ((-73.97177410965318 40.72582128133705...
4	5	0.092146	0.000498	Arden Heights	5	Staten Island	POLYGON ((-74.17421738099989 40.56256808599987...

Plot with matplotlib¶

We can easily plot with matplotlib, which is builtin to geopandas

In [3]:

%matplotlib inline
zones.plot(column='borough', categorical=True)

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f5762442668>

Plot with Bokeh¶

But for most of this notebook we'll be using Bokeh

In [4]:

from bokeh.io import output_notebook
from bokeh.models import GeoJSONDataSource, HoverTool, CategoricalColorMapper, LinearColorMapper
from bokeh.plotting import figure, show
from bokeh.palettes import Category10
output_notebook()

Loading BokehJS ...

In [5]:

geo_source = GeoJSONDataSource(geojson=zones.to_json())

factors = zones.borough.drop_duplicates()
color_mapper = CategoricalColorMapper(factors=factors.tolist(), palette=Category10[10])

fig = figure()
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source, 
          color={'field': 'borough', 'transform': color_mapper},
         )

hover = HoverTool(
    point_policy='follow_mouse',
    tooltips='<p><b>Borough</b>: @borough</p><p><b>Zone</b>: @zone</p>'
)
fig.add_tools(hover)

fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False

show(fig)

Full dataset¶

We read a few columns of the 2015 NYC Taxi and Limousine Commission data with dask.dataframe, and then perform a spatial join with dask-geopandas. Finally we write the result to the parquet format for future analysis.

We'll hold onto only those columns that we need for future computations.

In [6]:

import dask.dataframe as dd
import dask_geopandas as dg
import pandas as pd
import dask

columns = ['pickup_latitude', 'pickup_longitude', 'passenger_count', 
           'fare_amount', 'tip_amount', 'payment_type']
df = dd.read_csv('csv/yellow_tripdata_2015-*.csv', usecols=columns)
gf = df.set_geometry(df[['pickup_longitude', 'pickup_latitude']], 
                     crs={'init' :'epsg:4326'})
gf2 = dg.sjoin(gf, zones[['zone', 'borough', 'geometry']])
full = gf2[['zone', 'passenger_count', 'fare_amount', 'tip_amount', 'payment_type']]
full['payment_type'] = full.payment_type.astype('u1')
full.to_parquet('nyc-zones.parquet')
full = dd.read_parquet('nyc-zones.parquet')

Warning: CRS does not match

In [7]:

full.head()

Out[7]:

	zone	passenger_count	fare_amount	tip_amount	payment_type
0	Penn Station/Madison Sq West	1	12.0	3.25	1
1	SoHo	1	14.5	2.00	1
2	Morningside Heights	1	9.5	0.00	2
3	TriBeCa/Civic Center	1	3.5	0.00	2
4	Midtown North	1	15.0	0.00	2

Counts by region¶

Lets start off by reproducing Ravi's plot from his blogpost.

This computes the number of rides in each taxi zone. We plot these on a log scale because the number of rides is so disparate in different parts of the city. Zones in downtown Manhattan have millions of rides per year while some larger zones in Staten Island have only a few tens of rides.

In [8]:

result = full.passenger_count.groupby(full.zone).count().compute()
result.name = 'count'
joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))

In [9]:

figures = {}

In [10]:

from bokeh.palettes import viridis
from bokeh.models import ColorBar, LogColorMapper

geo_source = GeoJSONDataSource(geojson=joined.to_json())

color_mapper = LogColorMapper(viridis(256))

fig = figure(title='Number of rides')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source, 
            color={'field': 'count', 'transform': color_mapper}, 
            line_width=1, line_alpha=0.5, line_color='black')

hover = HoverTool(
    point_policy='follow_mouse',
    tooltips=('<div><b>Borough</b>: @borough</div>'
              '<div><b>Zone</b>: @zone</div>'
              '<div><b>Rides</b>: @count</div>')
)
fig.add_tools(hover)

color_bar = ColorBar(color_mapper=color_mapper, 
                     location=(0, 0), 
                     label_standoff=12)
fig.add_layout(color_bar, 'right')

fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False

figures['ride-counts'] = fig

show(fig)

W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='72e74163-642d-4682-b0c4-b9766bf7ccd8', ...)

Tip fraction¶

Regardless of the number of rides we can still compute the average tip fraction spatially over the city. We find generally the same 20% tip throughout the city, with oddly high percentages in the West Village and Dyker Heights.

In [11]:

df = full[(full.tip_amount > 0) & (full.fare_amount > 0)]
tips = df.tip_amount.groupby(df.zone).sum()
fares = df.fare_amount.groupby(df.zone).sum()
result = tips / fares
result = result.compute()

In [12]:

result.name = 'tip_fraction'

In [13]:

joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))
joined.head()

I am densified (5 elements)
I am densified (5 elements)

Out[13]:

	tip_fraction	OBJECTID	Shape_Leng	Shape_Area	zone	LocationID	borough	geometry
2	0.275410	3	0.084341	0.000314	Allerton/Pelham Gardens	3	Bronx	POLYGON ((-73.84792614099985 40.87134223399991...
3	0.201290	4	0.043567	0.000112	Alphabet City	4	Manhattan	POLYGON ((-73.97177410965318 40.72582128133705...
4	0.238670	5	0.092146	0.000498	Arden Heights	5	Staten Island	POLYGON ((-74.17421738099989 40.56256808599987...
5	0.350325	6	0.150491	0.000606	Arrochar/Fort Wadsworth	6	Staten Island	POLYGON ((-74.06367318899999 40.60219816599994...
6	0.212477	7	0.107417	0.000390	Astoria	7	Queens	POLYGON ((-73.90413637799996 40.76752031699986...

In [14]:

from bokeh.palettes import viridis
from bokeh.models import ColorBar

geo_source = GeoJSONDataSource(geojson=joined.to_json())

color_mapper = LogColorMapper(viridis(256))

fig = figure(title='Average Tip Fraction')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source, 
            color={'field': 'tip_fraction', 'transform': color_mapper}, 
            line_width=1, line_alpha=0.5, line_color='black')

hover = HoverTool(
    point_policy='follow_mouse',
    tooltips=('<div><b>Borough</b>: @borough</div>'
              '<div><b>Zone</b>: @zone</div>'
              '<div><b>Tip Fraction</b>: @tip_fraction</div>')
)
fig.add_tools(hover)

color_bar = ColorBar(color_mapper=color_mapper, 
                     location=(0, 0), 
                     label_standoff=12)
fig.add_layout(color_bar, 'right')

fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False

figures['tip-fraction'] = fig

show(fig)

W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='ee81443a-4819-45b6-9174-edfc431c1cc2', ...)

Cash vs Credit¶

Finally lets look at the prevalance of cash tips. We find that the outer boroughs (Brooklyn, Queens, Long Island) tend to prefer cash while Manhattan and the airports tend to prefer credit.

In [15]:

cash = full.payment_type == 2
result = cash.groupby(full.zone).mean().compute()

In [16]:

result.head()

Out[16]:

zone
Allerton/Pelham Gardens    0.561162
Alphabet City              0.334731
Arden Heights              0.435897
Arrochar/Fort Wadsworth    0.500000
Astoria                    0.539167
Name: payment_type, dtype: float64

In [17]:

joined = gpd.GeoDataFrame(pd.merge(result.to_frame(), zones, left_index=True, right_on='zone'))
joined.head()

I am densified (5 elements)
I am densified (5 elements)

Out[17]:

	payment_type	OBJECTID	Shape_Leng	Shape_Area	zone	LocationID	borough	geometry
2	0.561162	3	0.084341	0.000314	Allerton/Pelham Gardens	3	Bronx	POLYGON ((-73.84792614099985 40.87134223399991...
3	0.334731	4	0.043567	0.000112	Alphabet City	4	Manhattan	POLYGON ((-73.97177410965318 40.72582128133705...
4	0.435897	5	0.092146	0.000498	Arden Heights	5	Staten Island	POLYGON ((-74.17421738099989 40.56256808599987...
5	0.500000	6	0.150491	0.000606	Arrochar/Fort Wadsworth	6	Staten Island	POLYGON ((-74.06367318899999 40.60219816599994...
6	0.539167	7	0.107417	0.000390	Astoria	7	Queens	POLYGON ((-73.90413637799996 40.76752031699986...

In [18]:

from bokeh.palettes import viridis, RdBu
from bokeh.models import ColorBar

geo_source = GeoJSONDataSource(geojson=joined.to_json())

color_mapper = LinearColorMapper(RdBu[11])

fig = figure(title='Fraction of rides that tip in cash')
fig.patches(xs='xs', ys='ys', alpha=0.9, source=geo_source, 
            color={'field': 'payment_type', 'transform': color_mapper}, 
            line_width=1, line_alpha=0.5, line_color='black')

hover = HoverTool(
    point_policy='follow_mouse',
    tooltips=('<div><b>Borough</b>: @borough</div>'
              '<div><b>Zone</b>: @zone</div>'
              '<div><b>Cash Fraction</b>: @payment_type</div>')
)
fig.add_tools(hover)

color_bar = ColorBar(color_mapper=color_mapper, 
                     location=(0, 0), 
                     label_standoff=12)
fig.add_layout(color_bar, 'right')

fig.xaxis.visible = False
fig.yaxis.visible = False
fig.grid.visible = False

figures['cash-fraction'] = fig

show(fig)

W-1005 (SNAPPED_TOOLBAR_ANNOTATIONS): Snapped toolbars and annotations on the same side MAY overlap visually: Figure(id='4733bcdf-cf01-4817-828a-f92a7dc3fd03', ...)

I use the code below to embed these plots into my blog.

from bokeh.embed import components

script, divs = components(list(figures.values()))
with open('/home/mrocklin/workspace/blog/_posts/work/2017-09-09-accelerating-geopandas-1.md', 'at') as f:
    for div in divs:
        f.write(div)
        f.write('\n\n')
    f.write(script)