Notebook

Tutorial 10. Working with large datasets

HoloViews supports even high-dimensional datasets easily, and the standard mechanisms discussed already work well as long as you select a small enough subset of the data to display at any one time. However, some datasets are just inherently large, even for a single frame of data, and cannot safely be transferred for display in any standard web browser. Luckily, HoloViews makes it simple for you to use the separate Datashader library together with any of the plotting extension libraries, including Bokeh and Matplotlib. Datashader is designed to complement standard plotting libraries by providing faithful visualizations for very large datasets, focusing on revealing the overall distribution, not just individual data points.

Datashader uses computations accelerated using Numba, making it fast to work with datasets of millions or billions of datapoints stored in Dask dataframes. Dask dataframes provide an API that is functionally equivalent to Pandas, but allows working with data out of core and scaling out to many processors across compute clusters. Here we will use Dask to load a large Parquet-format file of taxi coordinates.

How does datashader work?¶

Tools like Bokeh map Data (left) directly into an HTML/JavaScript Plot (right)
datashader instead renders Data into a plot-sized Aggregate array, from which an Image can be constructed then embedded into a Bokeh Plot
Only the fixed-sized Image needs to be sent to the browser, allowing millions or billions of datapoints to be used
Every step automatically adjusts to the data, but can be customized

When not to use datashader¶

Plotting less than 1e5 or 1e6 data points
When every datapoint must be resolveable individually; standard Bokeh will render all of them
For full interactivity (hover tools) with every datapoint

When to use datashader¶

Actual big data; when Bokeh/Matplotlib have trouble
When the distribution matters more than individual points
When you find yourself sampling, decimating, or binning to better understand the distribution

In [ ]:

import holoviews as hv
import dask.dataframe as dd
import datashader as ds
import geoviews as gv

from holoviews.operation.datashader import datashade, rasterize
hv.extension('bokeh')

Load the data¶

As a first step we will load a large dataset using Dask. If you have followed the setup instructions you will have downloaded a large Parquet-format file containing 12 million taxi trips. Let's load this data using dask to create a dataframe ddf:

If you are low on memory (less than 8 GB) or have not downloaded the full data, you can load only a subset of the data by changing 'nyc_taxi_wide.parq' to 'nyc_taxi_50k.parq'.

In [ ]:

ddf = dd.read_parquet('../data/nyc_taxi_wide.parq').persist()

print('%s Rows' % len(ddf))
print('Columns:', list(ddf.columns))

Create a dataset¶

In previous sections we have already seen how to declare a set of Points from a pandas dataframe. Here we do the same for a Dask dataframe passed in with the desired key dimensions:

In [ ]:

points = hv.Points(ddf, kdims=['dropoff_x', 'dropoff_y'])

We could now simply type points, and Bokeh would attempt to display this data as a standard Bokeh plot. Before doing that, however, remember that we have 12 million rows of data, and no current plotting program will handle that well in a web browser! Instead of letting Bokeh see this data, let's convert it to something far more tractable using the datashade operation. This operation will aggregate the data on a 2D grid, apply shading to assign pixel colors to each bin in this grid, and build an RGB Element (just a fixed-sized image) we can safely display in a browser:

In [ ]:

%opts RGB [width=600 height=500 bgcolor="black"]
datashade(points)

If you zoom in you will note that the plot rerenders depending on the zoom level, which allows the full dataset to be explored interactively even though only an image of it is ever sent to the browser. The way this works is that datashade is a dynamic operation that also declares some linked streams. These linked streams are automatically instantiated and dynamically supply the plot_size, x_range, and y_range from the Bokeh plot to the operation based on your current viewport as you zoom or pan:

In [ ]:

datashade.streams

In [ ]:

# Exercise: Plot the taxi pickup locations ('pickup_x' and 'pickup_y' columns)
# Warning: Don't try to display hv.Points() directly; it's too big! Use datashade() or rasterize() for any display
# Optional: Change the cmap on the datashade operation to inferno

from datashader.colors import inferno

Adding a tile source¶

Using the GeoViews (geographic) extension for HoloViews, we can display a map in the background. Just declare a Bokeh WMTSTileSource and pass it to the gv.WMTS Element, then we can overlay it:

In [ ]:

%opts RGB [xaxis=None yaxis=None]
from bokeh.models import WMTSTileSource
url = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'
wmts = WMTSTileSource(url=url)
gv.WMTS(wmts) * datashade(points)

In [ ]:

# Exercise: Overlay the taxi pickup data on top of the Wikipedia tile source

wiki_url = 'https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'

Aggregating with a variable¶

So far we have simply been counting taxi dropoffs, but our dataset is much richer than that. We have information about a number of variables including the base cost of a taxi ride, the fare_amount. Datashader provides a number of aggregator functions, which you can supply to the datashade operation. Here use the ds.mean aggregator to compute the average cost of a trip at a dropoff location:

In [ ]:

selected = points.select(fare_amount=(None, 1000))
selected.data = selected.data.persist()
gv.WMTS(wmts) * datashade(selected, aggregator=ds.mean('fare_amount'))

In [ ]:

# Exercise: Use the ds.min or ds.max aggregator to visualize  ``tip_amount``  by dropoff location
# Optional: Eliminate outliers by using select

Grouping by a variable¶

Because datashading happens only just before visualization, you can use any of the techniques shown in previous sections to select, filter, or group your data before visualizing it, such as grouping it by the hour of day:

In [ ]:

%opts Image [width=600 height=500 logz=True xaxis=None yaxis=None tools=['hover'] colorbar=True]
taxi_ds = hv.Dataset(ddf)
grouped = taxi_ds.to(hv.Points, ['dropoff_x', 'dropoff_y'], groupby=['dropoff_hour'], dynamic=True)
rasterize(grouped).redim.values(dropoff_hour=range(24))

Here we're using rasterize(), which is the same as datashade() except that instead of using datashader to map the counts per pixel into a color for that pixel, it's letting Bokeh do that mapping, in your local browser. Although letting Bokeh do the colormapping means that we can't use datashader-specific features such as the default eq_hist histogram normalization, it does mean that we can use Bokeh's features like colorbars and hover support. Here, if you hover over a given pixel, it will tell you the count of trips that fall into that pixel, along with the location in Web Mercator coordinates.

In [ ]:

# Exercise: Facet the trips in the morning hours as an NdLayout using rasterize(grouped.layout())
# Hint: You can reuse the existing grouped variable or select a subset before using the .to method

As you can see, datashader requires taking some extra steps into consideration, but it makes it practical to work with even quite large datasets on an ordinary laptop. On a 16GB machine, datasets 10X or 100X the one used here should be very practical, as illustrated at the datashader web site.

Onwards¶

The HoloViews Large Data user guide explains in more detail how to work with large datasets using datashader.
HoloViews also contains a sample bokeh app using this dataset and an additional linked stream that works well as a starting point.

Next, we go on to show how to create live plots of dynamically updated data (i.e. streaming data).