Notebook

Tutorial 02. Annotating your Data

Preliminaries¶

In this Jupyter notebook, we will be using the NumPy, Pandas, and HoloViews libraries to work with our data, using the standard abbreviations np, pd, and hv. In order to show how HoloViews works, we'll focus on the native HoloViews API and not the simplified Pandas-style HVPlot interface, but you can use either one as you prefer.

In [ ]:

import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')

hv.extension('bokeh') loads the Bokeh plotting support in HoloViews; hv.extension('matplotlib') would use the matplotlib support instead. If pyviz is installed correctly as described at pyviz.org, you should see a HoloViews logo and a small Bokeh logo when you run the cell above.

Annotating data to make it visualizable¶

By default, Python will typically display your data in a numerical form. For instance, let's say you have a simple Pandas dataframe containing 41 samples of a quadratic function y=100-x^2, at different values of x:

In [ ]:

xs = np.arange(-10, 10.5, 0.5)
ys = 100-xs**2

df = pd.DataFrame(dict(x=xs, y=ys))

This dataframe will display as a table of numbers:

In [ ]:

df

It's important to be able to access this numerical representation, but it's awkward to work with, hard to take in at a glance, and even this tiny dataset won't fit on a single screen in table form. In many cases, a graphical representation would be much more natural to work with, but visualizing anything but the simplest data typically requires writing a complex series of plotting commands that are distracting when exploring data. HoloViews provides an alternative approach: instead of having plotting as a separate task, annotate your data to make it instantly visualizable on demand in whatever context it ends up. You can then switch between graphical and numerical representations whenever you wish.

In this case, we know that the data is a set of samples from a continuous function of x, and we can capture that information by declaring a HoloViews object of type Curve:

In [ ]:

simple_curve = hv.Curve(df,'x','y')
print(simple_curve)

:Curve [x] (y) is HoloViews's shorthand for saying that the data in df is a set of samples from a continuous function y of one independent variable x, and simple_curve simply pairs your dataframe df with this semantic declaration.

Once we've captured this crucial bit of metadata, HoloViews now knows enough about this object to represent it graphically, as it will do by default in a Jupyter notebook:

In [ ]:

simple_curve

This Bokeh plot is much more convenient to examine than a column of numbers, because it conveys the entire set of data in a compact, easily appreciated, interactively explorable format. HoloViews knew that a continuous curve like this is the right representation for what would otherwise be just a table of numbers, because we explicitly declared the element type as hv.Curve. Crucially, simple_curve itself is not a plot, it's just a simple wrapper around your data that happens to have a convenient graphical representation. The full dataframe will always be available as simple_curve.data, for any numerical computations you would like to do:

In [ ]:

simple_curve.data.tail()

As you can see, with HoloViews you don't have to select between plotting your data and working with it numerically. Any HoloViews object will let you do both conveniently; you can simply choose whatever representation is the most appropriate way to approach the task you are doing. This approach is very different from a traditional plotting program, where the objects you create (e.g. a Matplotlib figure or a native Bokeh plot) are a dead end from an analysis perspective, useful only for plotting. This tutorial will show you how to create visualizable Elements like the Curve object above, and some of the ways you can annotate them to make sure the visual representation accurately conveys the properties of your data.

HoloViews Elements¶

Elements are HoloViews' most basic, core primitives. Elements allow you to give a dataset a semantically meaningful visual representation, while preserving the raw data you supplied and allowing various analysis operations to be applied to it. Nearly every Element can be constructed in the same way as you saw for Curve above (apart from Histogram and certain annotations):

hv.Element(data, kdims=None, vdims=None, **kwargs)

This standard signature consists of the same five types of information:

Element: any of the dozens of element types shown in the reference gallery.
data: your data in one of a number of formats described below, such as tabular dataframes or multidimensional gridded Xarray or Numpy arrays.
kdims: "key dimension(s)", also called independent variables or index dimensions in other contexts---the values for which your data was measured.
vdims: "value dimension(s)", also called dependent variables or measurements---what was measured or recorded for each value of the key dimensions.
kwargs: optional keyword arguments specific to that Element type (rarely needed).

For simple_curve, we selected column x as the key dimension and y as the value dimension, corresponding to how we generated the data (y as a function of x). Other element types might expect multiple key dimensions as a list, or allow multiple value dimensions (to show multiple data points for the same keys). We supplied a Pandas dataframe for the data, but we could just as well have supplied the original Numpy arrays xs and ys, Python lists, or a Python dictionary:

In [ ]:

hv.Curve((xs,ys))  +  hv.Curve((list(xs),list(ys)))  +  hv.Curve({'x':xs,'y':ys})

(We will discuss the + operator below, but here it just lets us put several HoloViews objects side by side). Each of these data formats would be represented in slightly different textual forms by Python, but as you can see, the graphical representation is the same for all of them because they all represent the same datapoints on a continuous curve.

Of course, maybe you don't want to treat these samples as a curve; perhaps for your purposes you want to see the individual data points, or maybe you want to draw attention to the area under the curve. You can provide this information to HoloViews by selecting the appropriate Element, such as Scatter, Area, or [Bars]((http://holoviews.org/reference/elements/bokeh/Bars.html). Try creating some of those elements here as an exercise!

In [ ]:

# Exercise: Instead of hv.Curve(df,'x','y'), try hv.Area or hv.Scatter to choose how your dataframe appears

Annotating an element¶

Wrapping your data (df or (xs,ys) here) as a HoloViews element is sufficient to make it visualizable, but there are many other aspects of the data that we can capture to convey more about its meaning to HoloViews. For instance, we might want to specify what the x-axis and y-axis actually correspond to, in the real world. Perhaps this parabola is the trajectory of a ball thrown into the air, in which case we could declare the object as:

In [ ]:

trajectory = hv.Curve(df, ('x','Horizontal distance'), ('y','Height'))
trajectory

Here x is a short and convenient tab-completeable Python identifier for the dimension that must match the column name when using dataframes, while Horizontal distance is a human-readable label conveying more about what it means than the original "x".

Even though the additional information we provided is a description of the data, not parameters of a plotting object, HoloViews is designed to reveal the properties of your data accurately, and so the axes now update to show what these dimensions represent. We'll look more at Dimensions like x below, including how to add units or other additional semantic information.

In [ ]:

# Exercise: Take a look at trajectory.vdims

Casting between elements¶

The type of an element is a declaration of important facts about your data, which gives HoloViews the appropriate hint required to generate a suitable visual representation from it. For instance, calling it a Curve is a declaration from the user that the data consists of samples from a continuous function, which is why HoloViews plots it as a connected object. If we convert to an hv.Scatter object instead, the same set of data will show up as separated points, because "Scatter" does not make an assumption that the data is meant to be continuous:

In [ ]:

hv.Scatter(simple_curve)  +  hv.Area(simple_curve)

Casting the same data between different Element types in this way is often useful as a way to see your data differently, particularly if you are not certain of a single best way to interpret the data. Casting preserves your declared metadata as much as possible, propagating your declarations from the original object to the new one.

In [ ]:

# How do you predict the representation for hv.Scatter(trajectory) will differ from
# hv.Scatter(simple_curve) above? Try it!

In [ ]:

# Also try casting the trajectory to an area then back to a curve.

Gridded data¶

Most HoloViews elements fall into one of two categories, depending on whether they accept tabular or gridded data. The Curve above was constructed from columns of x- and y-values, while an Image can be constructed from a 2D NumPy array:

In [ ]:

x = np.linspace(0, 10, 500)
y = np.linspace(0, 10, 500)
xx, yy = np.meshgrid(x, y)

arr = np.sin(xx)*np.cos(yy)

As above, we know that this data was sampled from a continuous function, but this time it's a function of two key dimensions, so we declare it as an hv.Image object:

In [ ]:

image = hv.Image(arr)
image

A very commonly useful method on all types of elements is the .hist method, which adjoins a plot displaying the distribution of values:

In [ ]:

image.hist()

As you can see, the default names for the dimensions of an Image are x and y for the two key dimensions and z for the value dimension (shown as color). Overriding those names is easy:

In [ ]:

hv.Image(arr, ['xaxis','yaxis'], 'h').hist()

In [ ]:

# Exercise: Try visualizing different two-dimensional arrays.
# You can try a new function entirely or simple modifications of the existing one
# E.g., explore the effect of squaring and cubing the arguments to sine and cosine

Working with dimensions in dataframes¶

In each case above, we've been plotting all of the data provided to an object. A Pandas DataFrame very often has many columns of data available, more than will fit into a single Element:

In [ ]:

economic_data = pd.read_csv('../data/macro.csv')
economic_data.tail()

Here country and year are possible key dimensions, determining what measurements were taken, and the rest of the columns are possible value dimensions, i.e., the results of those measurements. Notice that the dataframe itself makes no distinction between the two types of dimension; this is information that a human must supply to map the values into something meaningfully visualizable.

As an example, let's build an element that helps us understand how the percentage growth in US GDP varies over time. As our dataframe contains GDP growth data for lots of countries, let us select the United States from the table and create a Curve element from it:

In [ ]:

US_data = economic_data[economic_data['country'] == 'United States'] # Select data for the US only
US_data.tail()

In [ ]:

growth_curve = hv.Curve(US_data, 'year', 'growth')
growth_curve

In [ ]:

# Exercise: Plot the unemployment (unem) by year

If we want to clarify that 'growth' is 'GDP growth', we could do so in the original call using hv.Curve(US_data, 'year', ('growth','GDP growth')), which is a convenient shortcut for creating a Dimension object with a label. We can also supply dimension labels after the fact using the redim method:

In [ ]:

gdp_growth = growth_curve.redim.label(growth='GDP growth', year='Year')
gdp_growth

Here the redim method associates a dimension label with each of the two key dimensions, creating and returning a new element called gdp_growth (you can check for yourself that growth_curve is unchanged).

The redim utility lets you easily change other dimension parameters, and as an example let's give our GDP growth dimension the appropriate unit:

In [ ]:

gdp_growth.redim.unit(growth='%')

In [ ]:

# Exercise: Use redim.unit to give the year dimension a better unit 
# For instance, relabel to 'Time' then give the unit as 'years'

Statistical elements¶

So far we have mostly considered elements that map data values directly onto the screen, whether a curve where the x- and y-coordinates are drawn on the screen or an Image where the luminance of each pixel is drawn on the screen. Many other types of visualizations instead summarize the data, typically by computing statistics from it.

One of the most common such summaries is a Histogram. To compute a standard histogram we can use the np.histogram function and wrap the return value in a Histogram element:

In [ ]:

hv.Histogram(np.histogram(economic_data.growth, normed=True), kdims='Growth')

Alternatively, we can let HoloViews compute the histogram using kernel density estimation by declaring a Distribution element, which accepts data along with the key dimension (a column of a dataframe, in this case) whose distribution we want to visualize:

In [ ]:

hv.Distribution(economic_data, 'growth')

Another option for summarizing a dataset is using a box plot, which we can declare using a BoxWhisker element, specifying the 'country' as a kdim and growth as the vdim, giving us an overview of the distribution of growth percentages for each country:

In [ ]:

hv.BoxWhisker(economic_data, 'country', 'growth')

Composing elements together¶

As you saw above, we very often want to combine multiple elements into a single plot, both to save space and to show how things are related. In this section, we introduce the two composition operators + and *, which build Layout and Overlay objects.

Layouts¶

Layouts are useful for grouping related data side by side:

In [ ]:

layout = trajectory + hv.Scatter(trajectory) + hv.Area(trajectory) + hv.Spikes(trajectory)
layout.cols(2)

Putting items together like this saves space, but even more importantly it acts as a declaration to HoloViews that these objects are related to each other. HoloViews thus ensures that all of them will share the same axis ranges for all dimensions that match, and zooming or panning on any one of them will make the corresponding change to the others (try it!). The next section of this tutorial will describe how to change or override this behavior when appropriate.

If we look closely, we can see that the result of applying the + operator is an hv.Layout object (with a hint that a two-column layout is desired):

In [ ]:

print(layout)

Now let us build a new layout by selecting elements from layout:

In [ ]:

layout.Curve.I + layout.Spikes.I

We see that a Layout lets us pick component elements via two levels of tab-completable attribute access. Note that by default the type of the element defines the first level of access and the second level of access uses Roman numerals by default (because Python identifiers cannot start with numbers).

These two levels correspond to another type of semantic declaration that applies to the elements directly (rather than their dimensions), called group and label. Specifically, group allows you to declare what kind of thing this object is, while label allows you to label which specific object it is. What you put in those declarations, if anything, will form the title of the plot:

In [ ]:

cannonball = trajectory.relabel('Cannonball', group='Trajectory')
integral = hv.Area(cannonball).relabel('Filled')
labelled_layout = cannonball + integral
labelled_layout 

In [ ]:

# Exercise: Try out the tab-completion of labelled_layout to build a new layout swapping the position of these elements

In [ ]:

# Optional: Try using two levels of dictionary-style access to grab the cannonball trajectory

Overlays¶

Layout places objects side by side, allowing it to collect (almost!) any HoloViews objects that you want to indicate are related. Another operator * allows you to overlay elements into a single plot, if they live in the same space (with matching dimensions, and preferably with similar ranges over those dimensions). The result of * is an Overlay:

In [ ]:

trajectory * hv.Spikes(trajectory)

The indexing system of Overlay is identical to that of Layout.

In [ ]:

# Exercise: Make an overlay of the Spikes object from layout on top of the filled trajectory area of labelled_layout

One thing that is specific to Overlays is the use of color cycles to automatically differentiate between elements of the same type and group:

In [ ]:

tennis_ball = cannonball.clone((xs, 0.5*np.array(ys)), label='Tennis Ball')
cannonball + tennis_ball + (cannonball * tennis_ball)

Here we use the clone method to make a shallower tennis-ball trajectory: the clone method create a new object that preserves semantic metadata while allowing overrides (in this case we override the input data and the label).

As you can see, HoloViews can determine that the two overlaid curves will be distinguished by color, and so it also provides a legend so that the mapping from color to data is clear. In an Overlay, the title will contain whatever is shared between the elements in the plot, and for the third plot here, the title is just the group name "Trajectory" because the two labels differ.

In [ ]:

# Optional Exercise: 
# 1. Create a thrown_ball curve with half the height of tennis_ball by cloning it and assigning the label 'Thrown ball'
# 2. Add thrown_ball to the overlay

Slicing and selecting¶

HoloViews elements can easily be sliced using array-style syntax or using the .select method. The following example shows how we can slice the cannonball trajectory into its ascending and descending components:

In [ ]:

ascending  = cannonball[-10.0:0.5].relabel('ascending')
descending = cannonball.select(x=(0,None)).relabel('descending')
ascending * descending

As before, the color automatically cycles when overlaying two elements of the same type, HoloViews adds a legend when the different components are labelled, and the plot's title is "Trajectory" because the group name is all that is common to these elements.

Note that the slicing in HoloViews is done in the continuous space of the dimension and not in the integer space of individual data samples. In this instance, the slice is over the Horizontal distance dimension and we can see that the slicing semantics follow the usual Python convention of an inclusive lower bound and an exclusive upper bound.

This example also illustrates why we typically use simple identifiers for dimension names and reserve longer descriptions for the dimension labels: certain methods such as the select method shown above accept dimension names as keywords.

Onwards¶

In later tutorials, we will see how elements and the principles of composition extend to containers, which makes exploring more complex data quick, easy, and interactive. Before we examine the container types, the next tutorial will look at how to customize the appearance of elements, change the plotting extension, and specify output formats.

For a quick demonstration related to what we will be covering, hit the kernel restart button (⟳) in the toolbar for this notebook, change hv.extension('bokeh') to hv.extension('matplotlib') in the first cell, and rerun the notebook!