Notebook

Uhepp
Sharing plots in a self-contained format

Lightning talk at PyHEP 2022 • Frank Sauerburger

Additional resources¶

Python package: https://pypi.org/project/uhepp/
Package documentation: https://uhepp.readthedocs.io/en/latest/
GitLab repsitory: https://gitlab.cern.ch/fsauerbu/uhepp
Online plot hub: https://uhepp.org
Online hub source code: https://gitlab.sauerburger.com/frank/uhepp.org
Feedback and contributions always welcome

How to use this notebook¶

This talk:
- Go through the notebook
- Explain as I go
Verbose comments and full-text between the cells
- Go though the notebook at your own pace

Why and what?¶

Traditionally, histograms in ROOT files: Full information (split by sample, all bins, ...)
But: Requires custom script to plot
- Gets out of sync
- Rednering to PDF/PNG version destroys information
Solution: Uhepp storage format (JSON/YAML) combines raw data and display settings
- Allows non-destructive changes
- Generic plotting code
- Retreive numeric values

In [ ]:

import matplotlib.pyplot as plt
import uproot
import uhepp

Inspecting the toy data file¶

Inspecting the root file. It contains three TH1F histograms, one for

measured data,
expected background, and
expected signal.

In [ ]:

root_file = uproot.open("toy_data.root")

In [ ]:

root_file.keys()

In [ ]:

data_raw = root_file["data"].values(flow=True)
data_raw

In [ ]:

data_raw.shape

In [ ]:

data_stat = root_file["data"].errors(flow=True)
data_yield = uhepp.Yield(data_raw, data_stat)

In [ ]:

bkg_raw = root_file["bkg"].values(flow=True)
bkg_stat = root_file["bkg"].errors(flow=True)
bkg_yield = uhepp.Yield(bkg_raw, bkg_stat)

In [ ]:

sig_raw = root_file["signal"].values(flow=True)
sig_stat = root_file["signal"].errors(flow=True)
sig_yield = uhepp.Yield(sig_raw, sig_stat)

In [ ]:

bin_edges = root_file["data"].to_numpy()[1]

Naive plot¶

First attempt to visualize the histograms using plat matplotlib functions.

Since the data is already histogrammed, we can use the trick and pass the histogram data as weights to hist().

In [ ]:

plt.hist(bin_edges[:-1], bin_edges, weights=data_raw[1:-1], histtype='step', label="Data")
plt.hist(bin_edges[:-1], bin_edges, weights=bkg_raw[1:-1], histtype='step', label="Background")
plt.hist(bin_edges[:-1], bin_edges, weights=sig_raw[1:-1], histtype='step', label="Signal")
plt.legend()
None

Create your first plot¶

Creating a plot with uhepp involved two tightly coupled but still separated tasks. First, we need to add the raw histogram contents. Afterward, we can set up the visual appearance of the plot.

In [ ]:

len(bin_edges)

Note there are 41 bin edges for 40 regular bins. The yield objects create above contains 42 bins, including the underflow and overflow bins.

In [ ]:

hist = uhepp.UHeppHist("$m$", bin_edges)

The raw data is stored in the yields attribute. It is another dictionary mapping arbitrary internal names to the binned data. The binned data is stored as Yield objects. The yield objects couple the value with its uncertainties. A yield object comes close to a ROOT TH1 object (with some key distinctions). Yield objects can be added, scales, etc., while propagating uncertainties. Let's first create the yield objects from the sample data.

Finally, let's add the yields to our histogram.

In [ ]:

hist.yields = {
    "sig":  sig_yield,
    "bkg":  bkg_yield,
    "data": data_yield
}

The names that we use here as dictionary keys, can be arbitrary strings. You are encouraged to use descriptive names, which makes editing the histogram much easier. We will use these names later to refer to the yields when we specify the main plot's content or the ratio plot. In this example, we've created a single background entry. In a real-world histogram, you would have several different physics processes with one entry per process. You are encouraged to use a fine-grained process list here. Merging two or more yield entries in the visual specification is easy.

Visual settings¶

From the visual point of view, uhepp histograms are made up of a list of stacks. A stack consists of stack items. The bin contents of stack items within a stack are added (stacked). Separate stacks are down on top of each other. A stack in this context does not only refer to a bar histogram (stepfilled). It could be only the outline of bar histogram (step) or points usually used for measured data.

Let's start by setting up a stack for the signal and background expectation (here mc for Monte Carlo). This stack consists of one stack item for the signal and one for the background.

In [ ]:

background_si = uhepp.StackItem(["bkg"], "Background")
signal_si = uhepp.StackItem(["sig"], "Signal", color="red")

The creation of a StackItem takes at least two arguments. The first argument is a list of yield names we've defined in the previous section. Here you see the real power of separating the yield data from their stack definition. By passing multiple processes to the first argument, we can merge histograms with a single line of Python code. The second argument is used as the label in the legend. This is also the time to specify the color, line style, and other style settings. We choose to have the signal in red and leave the background color up to the system.

The stack items need to be combined into a Stack. This tells the renderer to stack the bars of each stack item vertically. Finally, we can add the mc_stack to the stack list of our histogram object.

In [ ]:

mc_stack = uhepp.Stack([background_si, signal_si])
hist.stacks.append(mc_stack)

We proceed similarly with data. Here we have a single item in the stack. The type of the stack should be points.

In [ ]:

data_si = uhepp.StackItem(["data"], "Data")

data_stack = uhepp.Stack([data_si], bartype="points")
hist.stacks.append(data_stack)

To improve the output, we can also make the label of the x-axis more verbose. In the beginning, we've set the symbol already to m. Here we add a verbose label Mass and the unit GeV. On top, we can add some metadata. Some tools use the filename attribute to have a default filename, for example, when you render the plot to graphics file.

In [ ]:

hist.variable = "Mass"
hist.unit = "GeV"
hist.filename = "higgs_mass_dist"
hist.author = "Your name"

Now it is time to see how far we've got.

In [ ]:

hist.render()

So far so good. However, we realize that a ratio plot between data and the background plus signal model would have been nice. Adding a ratio plot is a matter of adding a RatioItem to the ratio list property. The two mandatory arguments of RatioItem are two lists of yield names: First, the one of the numerator. Second, the one for the denominator.

In [ ]:

ratio_si = uhepp.RatioItem(["data"], ["sig", "bkg"], bartype="points")
hist.ratio = [ratio_si]

hist.render()

Can you update the above example, to have the comparison of data against the background-only model?

You might interject that the binning is a bit too fine, and the statistical power of each bin is relatively low. No problem. We can rebin the histogram on the fly during rendering. Simply set an alternative binning. The original histograms are not modified, so you can always undo this step without rerunning your analysis framework.

In [ ]:

coarser_bins = hist.bin_edges[::4]  # Keep only every fourth bin edge
hist.rebin_edges = coarser_bins
hist.render()

To save the result as a graphics file, call the render method with a filename.

In [ ]:

hist.render("mass_dist.pdf")

Local storage¶

At this point, we might want to save the result not only in a graphics file but also as a uhepp plot. The whole point of uhepp is that you can save the intermediate result between raw data and graphics files such that you can make modifications to the visual appearance without reprocessing your data set.

The uhepp format does not define the syntax of the file format, only its semantics. You can use any semi-structured format that supports lists and maps. Popular choices are JSON or YAML. We will use JSON in this tutorial.

In [ ]:

hist.to_json("mass_hist.json")

Uhep plots stored in JSON or YAML can be modifed with many popular tools. You could even change labels or style settings manually with a text editor. Loading previously saved histograms is also easy.

In [ ]:

# Could be in a new Python session
import uhepp
loaded_hist = uhepp.from_json("mass_hist.json")
loaded_hist.render()

Think about your recent work. Was there a moment when you wanted to modify the labels, axis ranges, collaboration badges, color scheme, binning, histogram content etc., after the plots have been produced? Had you stored the intermediate data in uhepp format, these tasks are trivial.

Online storage¶

In high-energy physics, a common task is to processes enormous data sets. Often this is only possible with a computing cluster. The resulting histograms and plots are stored on a remote file system. To see what the analysis code did, one has to copy the files to the local computer.

Today, high-energy experiments can only be built and maintained by large collaborations. The actual analysis of the data takes place in a team. Showing histograms to colleagues is a daily recurring task. To me, it was both tedious and unsatisfactory to prepare a PDF plot books.

Wouldn't it be nice to have a central hub for all plots? Computing nodes can push histograms in uhepp format to the hub. The analyzer can review plots in realtime; colleagues can browse through the plots online, extract yields, download them, or even upload modified (e.g., rebinned) versions themself. Students could also benefit from a central hub. On the hub, a student can collect plots and histograms throughout the whole time as a student. Naturally, things like color schemes, process compositions, etc., evolve during the time as a student, however, in the thesis, it would be nice to present the material uniformly. If all the material is stored on the hub in uhepp format, it's easy to rebrand, recolor the histogram.

Go to https://uhepp.org/login and login using your CERN account and CERN's Single Sign-On.
Currently, signing-up is limited to ATLAS members. You can set up your own hub. The source code is located at https://gitlab.sauerburger.com/frank/uhepp.org
To use the API, you need an API token that is linked to your account. Go to https://uhepp.org/tokens/new and create a token. Follow the instructions given on the token creation page. You need to add the token and the API endpoint to your environment variables.
Plots on the uhepp hub are organized in collections. We will push the toy plot from the previous section to a new collection. Navigate to https://uhepp.org/c/new and create a new collection. On the collection page, you see the ID of the collection. We need to specify this ID when we want to push or pull plots. For this tutorial, we assume the ID is 1.

Now, you are ready to push your first plot.

In [ ]:

receipt = hist.push(1)
receipt

The return value of push() is a push receipt. By default, it renders to the URL where you can view the plot online. Every plot has a unique identifier (uuid). You can see the uuid in the URL. The push receipt gives you programmatic access to the generated uuid.

In [ ]:

receipt.uuid

If you have the uuid of a plot, you can also download it. The uhepp module provides the pull() method for this purpose.

In [ ]:

# This could be a new shell again or a different compute
import uhepp
uuid = 'a378d2b0-cde2-4266-be9b-85945d94880d'
hist = uhepp.pull(uuid)

In [ ]:

hist.render()

At this point, we might decide that the binning should be a bit finer again, and signal should be C1 orange instead of red. So let's change this.

In [ ]:

hist.rebin_edges = hist.bin_edges[::2]  # Drop only every second bin edges
hist.stacks[0].content[1].color = 'C1'  # 0-th stack, 1-st stack-item
hist.render()

Additional features¶

Additional features not shown in this tutorial

Systematic uncertainies
Optional metadata
Custom graphs

UheppSharing plots in a self-contained format

Additional resources¶

How to use this notebook¶

Why and what?¶

Inspecting the toy data file¶

Naive plot¶

Create your first plot¶

Visual settings¶

Local storage¶

Online storage¶

Additional features¶

Uhepp
Sharing plots in a self-contained format