Notebook

Tutorial 04. Working with tabular data

As we have already discovered, HoloViews elements are simple wrappers around your data that provide a semantically meaningful representation. The real power of HoloViews becomes most evident when working with larger, multi-dimensional datasets, whether they are tabular like in a database or CSV file, or gridded like large datasets of images.

Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having tidy data provides flexibility and simplicity. Here we will show how to make your data tidy as a first step, but see hvPlot for convenient ways to work with non-tidy data directly.

In this tutorial all the information you have learned in the previous sections will finally really pay off. We will discover how to facet data and use different element types to explore and visualize the data contained in a real dataset, using many of the same libraries introduced earlier along with some statistics methods from SciPy:

In [ ]:

import numpy as np
import scipy.stats as ss
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
%opts Curve Scatter Bars [tools=['hover']]

What is tabular, tidy data?¶

In [ ]:

macro_df = pd.read_csv('../data/macro.csv')
macro_df.head()

For tidy data, the columns of the table represent variables or dimensions and the rows represent observations.

The opposite of tidy data is so-called wide data. To see what wide data looks like, we can use the pandas pivot_table method:

In [ ]:

wide = macro_df.pivot_table('unem', 'year', 'country')
wide.head(5)

In this wide format we can see that each column represents the unemployment figures for one country indexed, and each row a particular year. A wide table can represent data very concisely in some cases, but it is difficult to work with in practice because it does not make dimensions easily accessible for plotting or analysis. To go from wide to tidy data you can use the pd.melt function:

In [ ]:

melted = pd.melt(wide.reset_index(), id_vars='year',  value_name='unemployment')
melted.head()

Declaring Dimensions on a Dataset¶

A HoloViews Dataset is similar to a HoloViews Element, without having any specific metadata that lets it visualize itself. A Dataset is useful for specifying a set of Dimensions that apply to the data, which will later be inherited by any visualizable elements that you create from the Dataset.

A HoloViews Dimension is the same concept as a dependent or independent variable in mathematics. In HoloViews such variables are called value dimensions and key dimensions (respectively). In macro_df, the 'country' and 'year' are the independent variables, so when we create a Dataset we will declare those as key dimensions. The remaining columns will automatically be inferred to be value dimensions:

In [ ]:

macro = hv.Dataset(macro_df, ['country', 'year'])
macro

One of the first things we'll want to do with our Dimensions is to give them more sensible labels using redim.label:

In [ ]:

macro = macro.redim.label(growth='GDP Growth', unem='Unemployment', year='Year', country='Country')

Notice how HoloViews differs from using a plotting library directly in this case -- here we can annotate the data once to capture our knowledge about what those columns represent, and those annotations will then feed directly into any plots later derived from this data. In a plotting program, you would normally need to supply such metadata every single time you make a new plot, which is tedious, error prone, and discourages easy exploration. In the rest of this tutorial we'll see how we can explore and reveal this annotated data.

Groupby¶

The great thing about a tidy tabular Dataset is that we can easily group the data by a particular variable, allowing us to plot or analyze each subset separately. Let's say for instance that we want to break the macro-economic data down by 'year'. Using the groupby method we can easily split the Dataset into subsets by year:

In [ ]:

print(macro.groupby('Year'))

The resulting object has a top-level data structure called a HoloMap, indexed by year. A HoloMap (like its dynamically generated equivalent DynamicMap) is a potentially many-dimensional indexable container for Elements and Datasets, allowing them to be explored easily.

However, we cannot visualize this particular HoloMap, because a Dataset has no visual representation. We haven't yet told HoloViews whether the various dependent variables here are continuous, discrete, binned, or any of the other properties the data could have that would allow a specific Element to be chosen.

Mapping dimensions to elements¶

Luckily, as soon as you choose what sort of Element you want a given column to be, you can make this data visualizable using the convenient .to method, which allows us to group the dataset and map dimensions to elements in a single step.

The .to method of a Dataset takes up to four main arguments:

The element you want to convert to
The key dimensions (i.e., independent variables) to display
The dependent variables to display, if any
The dimensions to group by, if any

As a first simple example let's go through such a declaration:

We will use a Curve, to declare that the variables are continuous
Our independent variable will be the 'year'
Our dependent variable will be 'unem'
We will groupby the 'country'.

In [ ]:

curves = macro.to(hv.Curve, 'year', 'unem', groupby='country')
print(curves)
curves

If you look at the printed output you will see that instead of a simple Curve we got a HoloMap of Curve Elements for each country. Each Curve is now visualizable, but we haven't told HoloViews what to do with the Country, and so to make the entire structure visualizable, HoloViews creates a widget where you can select which country you want to view. Additional value dimensions would result in additional widgets here.

Alternatively we could also group by the year and view the unemployment rate by country as Bars instead. If we simply want to groupby all remaining key dimensions (in this case just the year) we can leave out the groupby argument:

In [ ]:

%%opts Bars [width=600 xrotation=45]
bars = macro.sort('country').to(hv.Bars, 'country', 'unem')
bars

In [ ]:

# Exercise: Create a hv.HeatMap using ``macro.to``, declaring kdims 'year' and 'country', and vdims 'growth'
# You'll need to declare ``width`` and/or ``xrotation`` plot options for HeatMap to make the plot readable
# You can also add ``tools=['hover']`` to get more info on each cell

Displaying distributions¶

Often we want to summarize the distribution of values, e.g. to reveal the distribution of unemployment rates for each OECD country across all measurements. This means we want to ignore the 'year' dimension in our dataset, letting it be summarized instead. To stop HoloViews from grouping by the extra variable, we pass an empty list to the groupby argument. In this case we can easily declare the BoxWhisker directly, but omitting a key dimension from the groupby can be useful in cases when there are more dimensions:

In [ ]:

%%opts BoxWhisker [width=800 xrotation=30] (box_fill_color=Palette('Category20'))
macro.to(hv.BoxWhisker, 'country', 'growth', groupby=[])
# Is equivalent to:
hv.BoxWhisker(macro, kdims=['country'], vdims=['growth'])

In [ ]:

# Exercise: Display the distribution of GDP growth by year using the BoxWhisker element

Faceting dimensions¶

Once the data has been grouped into a HoloMap as we did above, we can further use the grouping capabilities by using the .grid, .layout and .overlay methods to lay the groups out on the page rather than flipping through them with a set of widgets.

NdOverlay¶

In [ ]:

%%opts Scatter [width=800 height=400 size_index='growth'] (color=Palette('Category20') size=5)
%%opts NdOverlay [legend_position='left']
ndoverlay = macro.to(hv.Scatter, 'year', ['unem', 'growth']).overlay()
print(ndoverlay)
ndoverlay.relabel('OECD Unemployment 1960 - 1990')

GridSpace¶

In [ ]:

%%opts GridSpace [shared_yaxis=True]
subset = macro.select(country=['Austria', 'Belgium', 'Netherlands', 'West Germany'])
grid = subset.to(hv.Bars, 'year', 'unem').grid()
print(grid)
grid

To understand what is actually going on here, let's rewrite this example in a slightly different way. Instead of using the convenient .to or .groupby methods, we can express the same thing by explicitly iterating over the countries we want to look at, selecting the subset of the data for that country using the .select and then passing these plots to the container we want.

In the example above that means we select by 'country' on the macro Dataset, pass the selection to Bars elements, and declare the key and value dimension to display. We then pass the dictionary of Bars elements to the GridSpace container and declare the kdim of the container as 'Country':

In [ ]:

countries = ['Austria', 'Belgium', 'Netherlands', 'West Germany']
hv.GridSpace({country: hv.Bars(macro.select(country=country), 'year', 'unem') for country in countries},
             kdims=['Country'])

As you can see, .to is much simpler and less error-prone in practice.

NdLayout¶

In [ ]:

%%opts Curve [width=200 height=200]
ndlayout = subset.to(hv.Curve, 'year', 'unem').layout()
print(ndlayout)
ndlayout

In [ ]:

## Exercise: Recreate the plot above using hv.NdLayout and using macro.select just as we did for the GridSpace above

Aggregating¶

Another common operation is computing aggregates (summary transformations that collapse some dimensions of the data down to scalar values like a mean or a variance). We can compute and visualize these easily using the aggregate method. The aggregate method lets you declare the dimension(s) to aggregate by and a function to aggregate with (with an optional secondary function to compute the spread if desired). Once we have computed the aggregate we can simply pass it to the Curve and ErrorBars:

In [ ]:

%%opts Curve [width=600]
agg = macro.reindex(vdims=['growth']).aggregate('year', function=np.mean, spreadfn=np.std)
hv.Curve(agg) * hv.ErrorBars(agg)

In [ ]:

# Exercise: Display aggregate GDP growth by country, building it up in a series of steps
# Step 1. First, aggregate the data by country rather than by year, using
# np.mean and ss.sem as the function and spreadfn, respectively, then
# make a `Bars` element from the resulting ``agg``

In [ ]:

# Step 2: You should now have a bars plot, but with no error bars. Now add ErrorBars as above. 
# Hint: You'll want to make the plot wider and use an xrotation to see the labels clearly

Onward¶

Go through the Tabular Data getting started and user guide.
Learn about slicing, indexing and sampling in the Indexing and Selecting Data user guide.

The next section shows a similar approach, but for working with gridded data, in multidimensional array formats.