boost-histogram logo

Using boost-histogram

Henry Schreiner, Hans Dembenski

Setup (done for you on binder):

conda env create
conda activate bh-talk
jupyter lab

The interesting parts of the environment:

  • boost-histogram 0.10.0: This is the current boost-histogram.
  • hist 2.0.0a5: You have to have a recent alpha release of Hist.
  • uproot4: The new version that will become default by the end of the year.
  • mplhep: The matplotlib HEP helper library

Note:

boost-histogram has minimal, lightweight requirements and works almost anywhere. Anything beyond NumPy and boost-histogram is just for our demos later.

In [ ]:
import boost_histogram as bh
import numpy as np
import matplotlib.pyplot as plt
import mplhep

IPython setup

For the demos, let's see the last expression so it's easier to follow what's happening, even if it is an assignment. The default here is 'last_expr', but 'last_expr_or_assign' is better for demos. Also see 'last', 'all', and 'none'.

In [ ]:
%config InteractiveShell.ast_node_interactivity="last_expr_or_assign"

See the SciPy 2020 talk on YouTube for general overview!

YouTube video snapshot


What is boost-histogram?

The Python ecosystem is missing a good Histogram object. NumPy can perform a histogram operation, but it does not produce an object, and there are limitations and performance issues. The closest thing we have to a histogram is in ROOT.

1: Basic histograms

Let's start with the basics. We will create a histogram using boost-histogram and fill it.

1.1: Data

Let's make a 1D dataset to run on.

In [ ]:
data1 = np.random.normal(3.5, 2.5, size=1_000_000)

1.2: Histogram definition

Now, let's make a 1D histogram

In [ ]:
hist1 = bh.Histogram(bh.axis.Regular(40, -2, 10))

1.3 Histogram fill

Like ROOT, we can fill after we make a histogram, as many times as we want. You can fill single values, but to take advantage of the performance, you should fill with arrays.

In [ ]:
hist1.fill(data1)

You can see that the histogram has been filled. Let's explicitly check to see how many entries are in the histogram:

In [ ]:
hist1.sum()

What happened to the missing items? They are in the underflow and overflow bins:

In [ ]:
hist1.sum(flow=True)

Like ROOT, we have overflow bins by default. We can turn them off, but they enable some powerful things like projections.

Let's plot this:

In [ ]:
plt.bar(hist1.axes[0].centers, hist1.view(), width=hist1.axes[0].widths);

Note: You can leave off the .view() if you want to - histograms conform to the buffer protocol. Also, you can select the axes before or after calling .centers; this is very useful for ND histograms.

From now on, let's be lazy, and use mplhep, which natively supports boost-histogram now.

In [ ]:
mplhep.histplot(hist1);

You should think of boost-histogram like NumPy: No plotting is built in, but the data is easy to access.

Unlike ROOT, Histograms are built of of basic building blocks: 1 or more axes, and a storage. The storage holds accumulators, which can be simple doubles or ints, or more complex things that hold extra information about the operation (which might not even be a sum! (generalized histograms). Here's what it looks like for 1D:

1D histogram drawing

If we go to 2D, this generalizes - we have the same API for ND!

2D histogram

2: Drop-in replacement for NumPy

To start using this yourself, you don't even need to change your code. Let's try the NumPy adapters.

In [ ]:
bins2, edges2 = bh.numpy.histogram(data1, bins=10)
In [ ]:
b2, e2 = np.histogram(data1, bins=10)
In [ ]:
bins2 - b2
In [ ]:
e2 - edges2

Not bad! Let's start moving to the boost-histogram API, so we can use the plotting function we just learned about:

In [ ]:
hist2 = bh.numpy.histogram(data1, bins="auto", histogram=bh.Histogram)
mplhep.histplot(hist2);

Now we can move over to boost-histogram one step at a time! Just to be complete, we can also go back to a Numpy tuple from a Histogram object:

In [ ]:
b2p, e2p = bh.numpy.histogram(data1, bins=10, histogram=bh.Histogram).to_numpy()
b2p == b2

And, while "recently" NumPy optimized 1D regular binning, we still beat optimized NumPy:

In [ ]:
%%timeit
bh.numpy.histogram(data1, bins=100)
In [ ]:
%%timeit
np.histogram(data1, bins=100)

3: More dimensions

The same API works for multiple dimensions.

In [ ]:
hist3 = bh.Histogram(bh.axis.Regular(150, -1.5, 1.5), bh.axis.Regular(100, -1, 1))
In [ ]:
data2d = [np.random.normal(size=1_000_000) for _ in range(2)]
In [ ]:
hist3.fill(*data2d)
In [ ]:
plt.pcolormesh(*hist3.axes.edges.T, hist3.view().T);

This is transposed because of differing indexing conventions.

Let's try a 3D histogram

In [ ]:
data3d = [np.random.normal(size=1_000_000) for _ in range(3)]

hist3d = bh.Histogram(
    bh.axis.Regular(150, -5, 5),
    bh.axis.Regular(100, -5, 5),
    bh.axis.Regular(100, -5, 5),
)

hist3d.fill(*data3d)

Let's project to the first two axes:

In [ ]:
mplhep.hist2dplot(hist3d[:, :, sum]);

4: UHI

Diagram of UHI

Let's explore the boost-histogram UHI syntax. We will start with a 1D histogram:

In [ ]:
h = bh.Histogram(bh.axis.Regular(100, -3.5, 3.5))

Fill it with some mildly interesting data:

In [ ]:
data = np.concatenate([
    np.random.normal(-.75,.3, 1_000_000),
    np.random.normal(.75,.3, 750_000),
    np.random.normal(-1.5,.2, 200_000),
])

h.fill(data)
In [ ]:
mplhep.histplot(h);

I can see that I want x from -2 to 0, in data coordinates:

In [ ]:
mplhep.histplot(h[bh.loc(-2):bh.loc(0)]);

What's the contents of a bin?

In [ ]:
h[bh.loc(-.75)]

To get the coordinates manually, you could do (not UHI):

In [ ]:
h.axes[0].index(.75)

How about reducing a histogram? Let's try the previous 2D Histogram

In [ ]:
mplhep.histplot(hist3[:, sum])
mplhep.histplot(hist3[sum, :]);

Let's look at one part and rebin:

In [ ]:
mplhep.hist2dplot(hist3[: 50 : bh.rebin(2), 50 :: bh.rebin(2)]);

What is the value at (-.75, .5)?

In [ ]:
hist3[bh.loc(-0.75), bh.loc(0.5)]

Distribution is state-of-the-art Scikit-HEP style

The Scikit-HEP developer pages were heavily influenced by boost-histogram.

Platforms

We now support ARM and PowerPC wheels!

Platforms

Ecosystem