Lightning talk at PyHEP 2022 • Frank Sauerburger
import matplotlib.pyplot as plt
import uproot
import uhepp
Inspecting the root file. It contains three TH1F histograms, one for
root_file = uproot.open("toy_data.root")
root_file.keys()
data_raw = root_file["data"].values(flow=True)
data_raw
data_raw.shape
data_stat = root_file["data"].errors(flow=True)
data_yield = uhepp.Yield(data_raw, data_stat)
bkg_raw = root_file["bkg"].values(flow=True)
bkg_stat = root_file["bkg"].errors(flow=True)
bkg_yield = uhepp.Yield(bkg_raw, bkg_stat)
sig_raw = root_file["signal"].values(flow=True)
sig_stat = root_file["signal"].errors(flow=True)
sig_yield = uhepp.Yield(sig_raw, sig_stat)
bin_edges = root_file["data"].to_numpy()[1]
First attempt to visualize the histograms using plat matplotlib functions.
Since the data is already histogrammed, we can use the trick and pass the histogram data as weights
to hist()
.
plt.hist(bin_edges[:-1], bin_edges, weights=data_raw[1:-1], histtype='step', label="Data")
plt.hist(bin_edges[:-1], bin_edges, weights=bkg_raw[1:-1], histtype='step', label="Background")
plt.hist(bin_edges[:-1], bin_edges, weights=sig_raw[1:-1], histtype='step', label="Signal")
plt.legend()
None
Creating a plot with uhepp involved two tightly coupled but still separated tasks. First, we need to add the raw histogram contents. Afterward, we can set up the visual appearance of the plot.
len(bin_edges)
Note there are 41 bin edges for 40 regular bins. The yield objects create above contains 42 bins, including the underflow and overflow bins.
hist = uhepp.UHeppHist("$m$", bin_edges)
The raw data is stored in the yields attribute. It is another dictionary mapping arbitrary internal names to the binned data. The binned data is stored as Yield objects. The yield objects couple the value with its uncertainties. A yield object comes close to a ROOT TH1 object (with some key distinctions). Yield objects can be added, scales, etc., while propagating uncertainties. Let's first create the yield objects from the sample data.
Finally, let's add the yields to our histogram.
hist.yields = {
"sig": sig_yield,
"bkg": bkg_yield,
"data": data_yield
}
The names that we use here as dictionary keys, can be arbitrary strings. You are encouraged to use descriptive names, which makes editing the histogram much easier. We will use these names later to refer to the yields when we specify the main plot's content or the ratio plot. In this example, we've created a single background entry. In a real-world histogram, you would have several different physics processes with one entry per process. You are encouraged to use a fine-grained process list here. Merging two or more yield entries in the visual specification is easy.
From the visual point of view, uhepp histograms are made up of a list of stacks. A stack consists of stack items. The bin contents of stack items within a stack are added (stacked). Separate stacks are down on top of each other. A stack in this context does not only refer to a bar histogram (stepfilled
). It could be only the outline of bar histogram (step
) or points
usually used for measured data.
Let's start by setting up a stack for the signal and background expectation (here mc
for Monte Carlo). This stack consists of one stack item for the signal
and one for the background
.
background_si = uhepp.StackItem(["bkg"], "Background")
signal_si = uhepp.StackItem(["sig"], "Signal", color="red")
The creation of a StackItem
takes at least two arguments. The first argument is a list of yield names we've defined in the previous section. Here you see the real power of separating the yield data from their stack definition. By passing multiple processes to the first argument, we can merge histograms with a single line of Python code. The second argument is used as the label in the legend. This is also the time to specify the color, line style, and other style settings. We choose to have the signal in red and leave the background color up to the system.
The stack items need to be combined into a Stack
. This tells the renderer to stack the bars of each stack item vertically. Finally, we can add the mc_stack
to the stack list of our histogram object.
mc_stack = uhepp.Stack([background_si, signal_si])
hist.stacks.append(mc_stack)
We proceed similarly with data. Here we have a single item in the stack. The type of the stack should be points
.
data_si = uhepp.StackItem(["data"], "Data")
data_stack = uhepp.Stack([data_si], bartype="points")
hist.stacks.append(data_stack)
To improve the output, we can also make the label of the x-axis more verbose. In the beginning, we've set the symbol already to m
. Here we add a verbose label Mass
and the unit GeV
. On top, we can add some metadata. Some tools use the filename attribute to have a default filename, for example, when you render the plot to graphics file.
hist.variable = "Mass"
hist.unit = "GeV"
hist.filename = "higgs_mass_dist"
hist.author = "Your name"
Now it is time to see how far we've got.
hist.render()
So far so good. However, we realize that a ratio plot between data and the background
plus signal
model would have been nice. Adding a ratio plot is a matter of adding a RatioItem
to the ratio
list property. The two mandatory arguments of RatioItem
are two lists of yield names: First, the one of the numerator. Second, the one for the denominator.
ratio_si = uhepp.RatioItem(["data"], ["sig", "bkg"], bartype="points")
hist.ratio = [ratio_si]
hist.render()
Can you update the above example, to have the comparison of data against the background-only model?
You might interject that the binning is a bit too fine, and the statistical power of each bin is relatively low. No problem. We can rebin the histogram on the fly during rendering. Simply set an alternative binning. The original histograms are not modified, so you can always undo this step without rerunning your analysis framework.
coarser_bins = hist.bin_edges[::4] # Keep only every fourth bin edge
hist.rebin_edges = coarser_bins
hist.render()
To save the result as a graphics file, call the render
method with a filename.
hist.render("mass_dist.pdf")
At this point, we might want to save the result not only in a graphics file but also as a uhepp plot. The whole point of uhepp is that you can save the intermediate result between raw data and graphics files such that you can make modifications to the visual appearance without reprocessing your data set.
The uhepp format does not define the syntax of the file format, only its semantics. You can use any semi-structured format that supports lists and maps. Popular choices are JSON or YAML. We will use JSON in this tutorial.
hist.to_json("mass_hist.json")
Uhep plots stored in JSON or YAML can be modifed with many popular tools. You could even change labels or style settings manually with a text editor. Loading previously saved histograms is also easy.
# Could be in a new Python session
import uhepp
loaded_hist = uhepp.from_json("mass_hist.json")
loaded_hist.render()
Think about your recent work. Was there a moment when you wanted to modify the labels, axis ranges, collaboration badges, color scheme, binning, histogram content etc., after the plots have been produced? Had you stored the intermediate data in uhepp format, these tasks are trivial.
In high-energy physics, a common task is to processes enormous data sets. Often this is only possible with a computing cluster. The resulting histograms and plots are stored on a remote file system. To see what the analysis code did, one has to copy the files to the local computer.
Today, high-energy experiments can only be built and maintained by large collaborations. The actual analysis of the data takes place in a team. Showing histograms to colleagues is a daily recurring task. To me, it was both tedious and unsatisfactory to prepare a PDF plot books.
Wouldn't it be nice to have a central hub for all plots? Computing nodes can push histograms in uhepp format to the hub. The analyzer can review plots in realtime; colleagues can browse through the plots online, extract yields, download them, or even upload modified (e.g., rebinned) versions themself. Students could also benefit from a central hub. On the hub, a student can collect plots and histograms throughout the whole time as a student. Naturally, things like color schemes, process compositions, etc., evolve during the time as a student, however, in the thesis, it would be nice to present the material uniformly. If all the material is stored on the hub in uhepp format, it's easy to rebrand, recolor the histogram.
1
.Now, you are ready to push your first plot.
receipt = hist.push(1)
receipt
The return value of push()
is a push receipt. By default, it renders to the URL where you can view the plot online. Every plot has a unique identifier (uuid). You can see the uuid in the URL. The push receipt gives you programmatic access to the generated uuid.
receipt.uuid
If you have the uuid of a plot, you can also download it. The uhepp module provides the pull()
method for this purpose.
# This could be a new shell again or a different compute
import uhepp
uuid = 'a378d2b0-cde2-4266-be9b-85945d94880d'
hist = uhepp.pull(uuid)
hist.render()
At this point, we might decide that the binning should be a bit finer again, and signal should be C1
orange instead of red. So let's change this.
hist.rebin_edges = hist.bin_edges[::2] # Drop only every second bin edges
hist.stacks[0].content[1].color = 'C1' # 0-th stack, 1-st stack-item
hist.render()
Additional features not shown in this tutorial