Notebook

Interactive Visualization with Bokeh, HoloViews, and Datashader¶

Developer(s): Keith Bechtol (@bechtol)
Maintainer(s): Leanne Guy (@leannep)
Level: Intermediate
Last Verified to Run: 2021-08-20
Verified Stack Release: w_2021_33

This notebook demonstrates a few of the interactive features of the Bokeh, HoloViews, and Datashader plotting packages in the notebook environment. These packages are part of the PyViz set of python tools intended for visualization use cases in a web browser, and can be used to create quite sophisticated dashboard-like interactive displays and widgets. The goal of this notebook is to provide an introduction and starting point from which to create more advanced, custom interactive visualizations. To get inspired, check out this beautiful example notebook using HSC data created with the qa_explorer tools.

Learning Objectives¶

After working through and studying this notebook you should be able to

Use bokeh to create interactive figures with brushing and linking between multiple plots
Use holoviews and datashader to create two-dimensional histograms with dynamic binning to efficiently explore large datasets

Other techniques that are demonstrated, but not empasized, in this notebook are

Use parquet to efficiently access large amounts of data

Logistics¶

This notebook is intended to be run on a LARGE instance of the RSP at lsst-lsp-stable.ncsa.illinois.edu or data.lsst.cloud from a local git clone of the StackClub repo.

Note that occasionally the notebook may seem to stall, or the interactive features may seem disabled. If this happens, usually a restart of the kernel fixes the issue. You might also check that you are on a large instance of the JupyterLab environment on the RSP. In some examples shown in this notebook, the order in which the cells are run is important for understanding the interactive features, so you may want to re-run the set of cells in a given section if you encounter unexpected behavior.

Setup¶

You can find the Stack version by using eups list -s on the terminal command line.

In [ ]:

# Site, host, and stack version
! echo $EXTERNAL_INSTANCE_URL
! echo $HOSTNAME
! eups list -s | grep lsst_distrib

In [ ]:

import numpy as np
import os.path
import astropy.io.fits as pyfits

import bokeh
from bokeh.io import output_file, output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, Range1d, HoverTool, Selection
from bokeh.plotting import figure, output_file

import holoviews as hv
from holoviews import streams
from holoviews.operation.datashader import datashade, dynspread, rasterize
from holoviews.plotting.util import process_cmap
hv.extension('bokeh')

In [ ]:

# Need this line to display bokeh plots inline in the notebook
output_notebook()

In [ ]:

# What version of holoviews are we using
print(hv.__version__)
import datashader as dsh
print(dsh.__version__)

In [ ]:

#Ignore all warnings 
import warnings
warnings.filterwarnings('ignore')

Prelude: Data Sample¶

The data in the following example comes from the Dark Energy Survey Data Release 1 (DES DR1). The input data for this example obtained with the M2 globular cluster database query in Appendix C of the DES DR1 paper from the DES Data Release page.

In [ ]:

basename = 'dr1_m2_dered_test.fits'
dirname  = os.path.expandvars('$HOME/DATA/')
filename = os.path.join(dirname,basename)

if not os.path.exists(dirname):
    !mkdir -p {dirname}
    
if not os.path.exists(filename):
    !curl https://lsst.ncsa.illinois.edu/~kadrlica/data/{basename} -o {filename}

In [ ]:

reader = pyfits.open(filename)
data = reader[1].data
reader.close()

data = data[data['MAG_AUTO_G_DERED'] < 26.]
print(len(data))

Part 1: Brushing and linking between scatter plots with Bokeh¶

First, an example with brushing and linking between two panels showing different repsentations of the same dataset. A selection applied to either panel will highlight the selected points in the other panel.

Based on http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/linking.html#linked-brushing

In [ ]:

ra_target, dec_target = 323.36, -0.82

mag = data['MAG_AUTO_G_DERED']
color = data['MAG_AUTO_G_DERED'] - data['MAG_AUTO_R_DERED']

# create a column data source for the plots to share
source = ColumnDataSource(data=dict(x0=data['RA'] - ra_target,
                                    y0=data['DEC'] - dec_target,
                                    x1=color,
                                    y1=mag,
                                    ra=data['RA'],
                                    dec=data['DEC'],
                                    coadd_object_id=data['COADD_OBJECT_ID']))

In [ ]:

# Create a custom hover tool on both panels
hover_left = HoverTool(tooltips=[("(RA,DEC)", "(@ra, @dec)"),
                                 ("(g-r,g)", "(@x1, @y1)"),
                                 ("coadd_object_id", "@coadd_object_id")])
hover_right = HoverTool(tooltips=[("(RA,DEC)", "(@ra, @dec)"),
                                  ("(g-r,g)", "(@x1, @y1)"),
                                  ("coadd_object_id", "@coadd_object_id")])
TOOLS = "box_zoom,box_select,lasso_select,reset,help"
TOOLS_LEFT = [hover_left, TOOLS]
TOOLS_RIGHT = [hover_right, TOOLS]

In [ ]:

# create a new plot and add a renderer
left = figure(tools=TOOLS_LEFT, plot_width=500, plot_height=500, output_backend="webgl",
              title='Spatial: Centered on (RA, Dec) = (%.2f, %.2f)'%(ra_target, dec_target))
left.circle('x0', 'y0', hover_color='firebrick', source=source,
            selection_fill_color='steelblue', selection_line_color='steelblue',
            nonselection_fill_color='silver', nonselection_line_color='silver')
left.x_range = Range1d(0.3, -0.3)
left.y_range = Range1d(-0.3, 0.3)
left.xaxis.axis_label = 'Delta RA'
left.yaxis.axis_label = 'Delta DEC'

# create another new plot and add a renderer
right = figure(tools=TOOLS_RIGHT, plot_width=500, plot_height=500, output_backend="webgl",
               title='CMD')
right.circle('x1', 'y1', hover_color='firebrick', source=source,
             selection_fill_color='steelblue', selection_line_color='steelblue',
             nonselection_fill_color='silver', nonselection_line_color='silver')
right.x_range = Range1d(-0.5, 2.5)
right.y_range = Range1d(26., 16.)
right.xaxis.axis_label = 'g - r'
right.yaxis.axis_label = 'g'

p = gridplot([[left, right]])

# The plots can be exported as html files with data embedded
#output_file("bokeh_m2_example.html", title="M2 Example")

show(p)

Use the hover tool to see information about individual datapoints (e.g., the coadd_object_id). This information should appear automatically as you hover the mouse over the datapoints. Notice the data points highlighted in red on one panel with the hover tool are also highlighted on the other panel.

Next, click on the selection box icon (with a "+" sign) or the selection lasso icon found in the upper right corner of the figure. Use the selection box and selection lasso to make various selections in either panel by clicking and dragging on either panel. The selected data points will be displayed in the other panel.

Introducing HoloViews Linked Streams¶

If we want to do subsequent calculations with the set of selected points, we can use HoloViews linked streams for custom interactivity. The following visualization is a modification of this example.

For this visualization, as in the example above, use the selection box and selection lasso to datapoints on the left panel. The selected points should appear in the right panel.

Finally, notice that as you change the selection on the left panel, the mean x- and y-values for selected datapoints are shown in the title of right panel.

In [ ]:

%%opts Points [tools=['box_select', 'lasso_select']]

# Declare some points
points = hv.Points((data['RA'] - ra_target, data['DEC'] - dec_target))

# Declare points as source of selection stream
selection = streams.Selection1D(source=points)

# Write function that uses the selection indices to slice points and compute stats
def selected_info(index):
    selected = points.iloc[index]
    if index:
        label = 'Mean x, y: %.3f, %.3f' % tuple(selected.array().mean(axis=0))
    else:
        label = 'No selection'
    return selected.relabel(label).options(color='red')

# Combine points and DynamicMap
# Notice the interesting syntax used here: the "+" sign makes side-by-side panels
points + hv.DynamicMap(selected_info, streams=[selection])

In the next cell, we access the indices of the selected datapoints. We could use these indices to select a subset of full sample for further examination.

In [ ]:

print(selection.index)

Intermission: Rapid Data Access with Parquet¶

For the next example, we want to use a much larger dataset. Let's open up some data from Gata Data Release 2 (Gaia DR2) with Parquet.

In [ ]:

import glob
import pandas as pd
import pyarrow.parquet as pq

In [ ]:

infiles = sorted(glob.glob('/project/shared/data/gaia_dr2/gaia_source_with_rv.parquet/*.parquet'))
print('There are %i total files in the directory'%(len(infiles)))

In [ ]:

%%time
df_array = []
for ii in range(0, len(infiles)):
    print(infiles[ii])
    columns = ['ra', 'dec', 'phot_g_mean_mag'] # 'phot_g_mean_mag', 'phot_bp_mean_mag', 'phot_rp_mean_mag']
    df_array.append(pq.read_table(infiles[ii], columns=columns).to_pandas())
df = pd.concat(df_array)

In [ ]:

print('Dataframe contains %.2f M rows'%(len(df) / 1.e6))
print(df.columns.values)

Part 2: Visualizing Larger Datasets with Datashader¶

The interactive features of Bokeh work well with datasets up to a few tens of thousands of data points. To efficiently explore larger datasets, we'd like to use another visualization model that offers better scalability, namely Datashader.

In the examples below, notice that as one zooms in on the datashaded two-dimensional histograms, the bin sizes are dynamically adjusted to show finer or coarser granularity in the distribution. This allows one to interactively explore large datasets without having to manually adjust the bin sizes while panning and zooming. Zoom in all the way and you can see individual points (i.e., bins contain either zero or one count). If you zoom in far enough, the individual points are represented by extremely small pixels in datashader that are difficult to see. A solution is to dynspread instead of datashade, which will preserve a finite size of the plotted points.

In this particular example, as we zoom in, we can see that the Gaia dataset has been sharded into narrow stripes in declination.

The next cell also uses the concept of linked Streams in HoloViews for custom interactivity, in this case to create a selection box. We'll use that selection box tool in the following cell.

In [ ]:

#%%opts Points [tools=['box_select']]
points = hv.Points((df.ra, df.dec)) # Create a holoviews object to hold and plot data
#points = hv.Points(np.random.multivariate_normal((0, 0), [[1, 0.1], [0.1, 1]], (1000,))) # If you wanted a simple synthetic dataset

# Create the linked streams instance
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box]) 

# Apply the datashader
from holoviews.plotting.util import process_cmap
dynspread(datashade(points, cmap=process_cmap("Viridis", provider="bokeh"))) * bounds
# The "*" syntax puts multiple plot elements on the same panel
#datashade(points, cmap=bokeh.palettes.Viridis256) * bounds

Next we add callback functionality to the plot above and retrieve the indices of the selected points. First, use the box selection tool to create a selection box for the two-dimensional histogram above. Then run the cell below to count the number of datapoints within the selection region.

In [ ]:

selection = (points.data.x > box.bounds[0]) \
    & (points.data.y > box.bounds[1]) \
    & (points.data.x < box.bounds[2]) \
    & (points.data.y < box.bounds[3])
print('The selection box contains %i datapoints'%(np.sum(selection)))
if np.sum(selection) > 0:
    print('\nHere are some of the selected indices...')
    print(np.nonzero(selection.values)[0])

Another option is to make a second linked plot paired with the box selection on the two-dimensional histogram.

In [ ]:

# First, create a holoviews dataset instance. Here we label some of the columns.
kdims = [('ra', 'RA(deg)'), ('dec', 'Dec(deg)')]
vdims = [('phot_g_mean_mag', 'G(mag)')]
ds = hv.Dataset(df, kdims, vdims)
ds

In [ ]:

points = hv.Points(ds)

#boundsxy = (0, 0, 0, 0)
boundsxy = (np.min(ds.data['ra']), np.min(ds.data['dec']), np.max(ds.data['ra']), np.max(ds.data['dec']))
box = streams.BoundsXY(source=points, bounds=boundsxy)
box_plot = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

In [ ]:

# This function defines the custom callback functionality to update the linked histogram
def update_histogram(bounds=bounds):
    
    selection = (ds.data['ra'] > bounds[0]) & \
                (ds.data['dec'] > bounds[1]) & \
                (ds.data['ra'] < bounds[2]) & \
                (ds.data['dec'] < bounds[3])
    
    selected_mag = ds.data.loc[selection]['phot_g_mean_mag']
    
    frequencies, edges = np.histogram(selected_mag)
    
    hist = hv.Histogram((np.log(frequencies), edges))
    return hist

In [ ]:

%%output size=150
dmap = hv.DynamicMap(update_histogram, streams=[box])
datashade(points, cmap=process_cmap("Viridis", provider="bokeh")) * box_plot + dmap

Notice that when you select different regions of the left panel with the box select tool, the histogram on the right is updated.

Part 3: Images¶

The next example demonstrates image visualization at the pixel level with datashader.

In [ ]:

# Select the dataset to use
# DC2 WFD coadd   
URL = os.getenv('EXTERNAL_INSTANCE_URL')
if URL.endswith('data.lsst.cloud'): # IDF
    repo = "s3://butler-us-central1-dp01"
elif URL.endswith('ncsa.illinois.edu'): # NCSA
    repo = "/repo/dc2"
else:
    raise Exception(f"Unrecognized URL: {URL}")

collection='2.2i/runs/DP0.1'
datasetType = "deepCoadd"
dataId = {'tract': 4226, 'band': 'i', 'patch': (0) + 7*(4) }

# The holoviews image frame size (will depend on image size)
frame='[height=512 width=600]'

In [ ]:

# Create the Butler and get the image
from lsst.daf.butler import Butler 
butler = Butler(repo,collections=collection)

image = butler.get(datasetType, dataId=dataId)

In [ ]:

%%opts Image $frame
%%opts Bounds (color='white')
#%%output size=200

# Create the Image
bounds_img = (0, 0, image.getDimensions()[0], image.getDimensions()[1])
img = hv.Image(np.log10(image.image.array), 
               bounds=bounds_img).options(colorbar=True, 
                                          cmap=bokeh.palettes.Viridis256,
                                          # logz=True
                                         )

boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=img, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

rasterize(img) * bounds

As with the histograms, it is possible to use interactive callback features on the image plots, such as the selection box.

In [ ]:

box

Here's another version of the image with a tap stream instead of box select. Click on the image to place an 'X' marker.

In [ ]:

%%opts Image  $frame
%%opts Points (color='white' marker='x' size=20)

posxy = hv.streams.Tap(source=img, x=0.5 * image.getDimensions()[0], y=0.5 * image.getDimensions()[1])
marker = hv.DynamicMap(lambda x, y: hv.Points([(x, y)]), streams=[posxy])

rasterize(img) * marker

'X' marks the spot! What's the value at that location? Execute the next cell to find out.

In [ ]:

print('The value at position (%.3f, %.3f) is %.3f'%(posxy.x, posxy.y, image.image.array[-int(posxy.y), int(posxy.x)]))