In this chapter, you will learn about various workspace manipulations including how to convert from HistFactory XML+ROOT workspaces to pyhf. We'll cover some common pitfalls such as locations of root files, and being able to set the base path for the conversion.
Note, getting the XML+ROOT won't necessarily be covered as part of the tutorial as it requires ROOT (though ROOT is installed in the Binder instance).
If you want to practice extracting out the HistFactory files from the workspace, first create the workspace like so:
# Need to be in the directory containing config directory
from os import chdir
from pathlib import Path
_top_level_dir = Path.cwd()
chdir(_top_level_dir.joinpath("data", "multichannel_histfactory"))
! hist2workspace config/example.xml
and you'll notice a few new files being made!
$ ls -lhF results/
total 136K
-rw-r--r-- 1 jovyan jovyan 40K Nov 8 21:01 example_channel1_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 38K Nov 8 21:01 example_channel2_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 47K Nov 8 21:01 example_combined_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 503 Nov 8 21:01 example_GaussExample.root
-rw-r--r-- 1 jovyan jovyan 26 Nov 8 21:01 example_results.table
! ls -lhF results/
In particular, example_combined_GaussExample_model.root
is the file that contains the RooStats::HistFactory::Measurement
object:
$ root results/example_combined_GaussExample_model.root
------------------------------------------------------------
| Welcome to ROOT 6.18/04 https://root.cern |
| (c) 1995-2019, The ROOT Team |
| Built for macosx64 on Sep 11 2019, 15:38:23 |
| From tags/v6-18-04@v6-18-04 |
| Try '.help', '.demo', '.license', '.credits', '.quit'/'.q' |
------------------------------------------------------------
root [0]
Attaching file results/example_combined_GaussExample_model.root as _file0...
RooFit v3.60 -- Developed by Wouter Verkerke and David Kirkby
Copyright (C) 2000-2013 NIKHEF, University of California & Stanford University
All rights reserved, please read http://roofit.sourceforge.net/license.txt
(TFile *) 0x7ffaa30d2130
root [1] .ls
TFile** results/example_combined_GaussExample_model.root
TFile* results/example_combined_GaussExample_model.root
KEY: RooWorkspace combined;1 combined
KEY: TProcessID ProcessID0;1 e1e9272e-fddb-11ea-86b3-1556a8c0beef
KEY: TDirectoryFile channel1_hists;1 channel1_hists
KEY: TDirectoryFile channel2_hists;1 channel2_hists
KEY: RooStats::HistFactory::Measurement GaussExample;1
from which you can extract out the necessary XML files as well:
root [2] GaussExample->PrintXML()
Printing XML Files for measurement: GaussExample
Printing XML Files for channel: channel1
Finished printing XML files
Printing XML Files for channel: channel2
Finished printing XML files
Finished printing XML files
To do this programatically, you can either write a ROOT
macro
// printXML.C
int printXML() {
TFile* _file0 = TFile::Open("results/example_combined_GaussExample_model.root");
_file0->Get<RooStats::HistFactory::Measurement>("GaussExample")->PrintXML();
return 0;
}
and run it
$ root -l -b -q printXML.C
but we can also do the same with PyROOT in as many lines
import ROOT
_file0 = ROOT.TFile.Open("results/example_combined_GaussExample_model.root")
_file0.GaussExample.PrintXML()
which dumps them into the same directory you ran from:
$ ls -lhF
total 24K
drwxr-xr-x 2 jovyan jovyan 4.0K Nov 8 19:52 config/
drwxr-xr-x 2 jovyan jovyan 4.0K Nov 8 19:52 data/
-rw-r--r-- 1 jovyan jovyan 1.1K Nov 8 21:01 GaussExample_channel1.xml
-rw-r--r-- 1 jovyan jovyan 794 Nov 8 21:01 GaussExample_channel2.xml
-rw-r--r-- 1 jovyan jovyan 459 Nov 8 21:01 GaussExample.xml
drwxr-xr-x 2 jovyan jovyan 4.0K Nov 8 21:01 results/
! ls -lhF
chdir(_top_level_dir)
! pyhf --help
Let's focus for now on pyhf xml2json
which requires that you have installed pyhf[xmlio]
(pyhf with the xmlio option).
python -m pip install pyhf[xmlio]
Again, the online documentation for this option is found here.
! pyhf xml2json --help
Let's remind ourselves of what the top-level XML file looks like, as this is the ENTRYPOINT_XML
.
! tail -n +15 data/multichannel_histfactory/config/example.xml | cat -n
So to explain these options:
basedir
specifies the base directory for where all the XML files are reference with respect to. As you can see from lines 3, 4, 5 - this should be the directory containing results/
and config/
output-file
specifies the output JSON file. If one is not specified, this will print to the screen, which you can redirect into a file if you want (pyhf xml2json ... > workspace.json
)hide-progress
will disable showing the progress bars when running the script... but we like progress bars 🙂Let's go ahead and run this command, but we won't specify the output file so it goes to the screen. We'll also disable the progress tracking, just so we have a nicer output for this tutorial.
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | cat -n
Only 130 lines for the entire workspace! Not too shabby. If we look through a couple of pieces:
channel1
signal
sample in channel1
Similarly, if we continue down to the second half of this JSON, we hit line 72 which specifies a list of measurements
for this workspace. In fact, we only have one measurement called GaussExample
with the parameter of interest defined as SigXsecOverSM
. This measurement also specifies additional parameter configuration such as details for the luminosity modifier (parameter name lumi
).
Nearly at the end, the next part of this specification is for the observations
(observed data) on line 113. Each observation corresponds with the channel, where channel1
has two bins, and channel2
also has two bins.
Finally, we have a version
which specifies the version of the schema used for the JSON HistFactory. In this case, we're using 1.0.0
which has the https://pyhf.readthedocs.io/en/v0.7.4/schemas/1.0.0/workspace.json definition which refers to the https://pyhf.readthedocs.io/en/v0.7.4/schemas/1.0.0/defs.json.
What's really nice about the schema definition is that it allows anyone to write their own tooling/scripting to build up the workspace and quickly check if it matches the schema. This will get you 90% of the way there in having a valid workspace to work with.
There are some additional checks that cannot be done, such as name conflicts, or ensuring that all samples in a channel have the same binning structure. The good news is that these checks can be done simply by loading up the workspace into a pyhf.Workspace
object which will do the schema validation, as well as the additional checks.
Speaking of pyhf.Workspace
objects...
Let's do the exact same thing, but from the python interpreter
import pyhf
import pyhf.readxml # not imported by default!
spec = pyhf.readxml.parse(
"data/multichannel_histfactory/config/example.xml", "data/multichannel_histfactory"
)
So we're not going to dump this out. We already did that above. Let's just quickly go ahead and load it into a pyhf.Workspace
object because we can.
ws = pyhf.Workspace(spec)
print(f" channels: {ws.channels}")
print(f" nbins: {ws.channel_nbins}")
print(f" samples: {ws.samples}")
print(f" modifiers: {ws.modifiers}")
print(f"observations: {ws.observations}")
Already, we're seeing a lot of information about this workspace as it's rather inspectable. Remember, this is not a model. What we call a 'model' is to combine the channel specification with a measurement... that is, a measurement of a workspace uniquely defines that model. A model might choose a particular parameter of interest to measure or set specific parameters as constant during the fit. These configurations are all stored in the measurements
key we saw above. We'll explore more about models in the next chapter.
Let's move on to more things we can do with the command line.
Now that we have a working command for converting our XML to JSON, let's go ahead and take advantage of the JSON output by piping it to pyhf inspect
which will print out a nice summary of our workspace.
! pyhf inspect --help
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf inspect
Immediately, we get a lot of useful information. We can see the number of channels, samples, parameters, and modifiers. Then we get a breakdown of the channels (and the number of bins for each channel), the samples, and the parameters. Finally, we see a list of measurements defined in the workspace, as well as the (*)
denoting the default measurement if one is not specified.
Could the number of parameters and modifiers differ?
There comes a time when you need to make comparisons to determine changes between two workspaces. This means depending on how the workspace is generated, one might need to "sort" it. pyhf sort
is a utility that will normalize the workspace for you, such that certain operations like calculating a checksum (pyhf digest
) guarantees unitarity.
For simple workspaces like the ones we're using in this tutorial, they're already sorted... however, this is not true in the real world. Notice how the bkg
is now the first sample and signal
is the second sample after sorting.
! pyhf sort --help
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf sort
Next up is a way to determine if two workspaces are equivalent, simply by comparing their computed digest. Note that this is based on the contents of the workspace and will not ensure floating-point differences are treated identically. That is, 2.19999999
and 2.2000001
will likely be treated as differently in the digest calculation as in python. We'll show here why sorting is very important.
! pyhf digest --help
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf digest
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf sort | \
pyhf digest
Remember that the ordering of the samples will have switched through the sorting.
The sha256
algorithm is used to compute the checksum for this workspace. This means that one can generally "normalize" all workspaces, then compute the digest and guarantee uniqueness. As with all command line functionality you've seen so far, there are equivalent ways to do it through python.
print(f"Unsorted: {pyhf.utils.digest(ws)}")
print(f"Sorted: {pyhf.utils.digest(pyhf.Workspace.sorted(ws))}")
Sometimes you want to manipulate workspaces by removing channels or samples or systematics (or measurements). This can be useful when trying to debug fits, or to build background-only workspaces, or to clean up a workspace.
! pyhf prune --help
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf prune -c channel1 | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf prune -s signal | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf prune -m uncorrshape_signal | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf prune -t shapesys | \
pyhf inspect
In addition to removing items, you might want to rename your channels, samples, modifiers, or measurement names. This can be useful for creating modifier correlations, or removing modifier correlations, or just cleaning up your workspace to get it ready for publication.
! pyhf rename --help
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf rename -c channel1 SR -c channel2 CR | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf rename -s bkg background | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf rename -m uncorrshape_signal corrshape -m uncorrshape_control corrshape | \
pyhf inspect
! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
pyhf rename --measurement GaussExample FitConfig | \
pyhf inspect