Notebook

Workspace Manipulations¶

In this chapter, you will learn about various workspace manipulations including how to convert from HistFactory XML+ROOT workspaces to pyhf. We'll cover some common pitfalls such as locations of root files, and being able to set the base path for the conversion.

Getting the XML+ROOT¶

Note, getting the XML+ROOT won't necessarily be covered as part of the tutorial as it requires ROOT (though ROOT is installed in the Binder instance).

If you want to practice extracting out the HistFactory files from the workspace, first create the workspace like so:

In [ ]:

# Need to be in the directory containing config directory
from os import chdir
from pathlib import Path

_top_level_dir = Path.cwd()
chdir(_top_level_dir.joinpath("data", "multichannel_histfactory"))

In [ ]:

! hist2workspace config/example.xml

and you'll notice a few new files being made!

$ ls -lhF results/
total 136K
-rw-r--r-- 1 jovyan jovyan 40K Nov  8 21:01 example_channel1_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 38K Nov  8 21:01 example_channel2_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 47K Nov  8 21:01 example_combined_GaussExample_model.root
-rw-r--r-- 1 jovyan jovyan 503 Nov  8 21:01 example_GaussExample.root
-rw-r--r-- 1 jovyan jovyan  26 Nov  8 21:01 example_results.table

In [ ]:

! ls -lhF results/

In particular, example_combined_GaussExample_model.root is the file that contains the RooStats::HistFactory::Measurement object:

$ root results/example_combined_GaussExample_model.root 
   ------------------------------------------------------------
  | Welcome to ROOT 6.18/04                  https://root.cern |
  |                               (c) 1995-2019, The ROOT Team |
  | Built for macosx64 on Sep 11 2019, 15:38:23                |
  | From tags/v6-18-04@v6-18-04                                |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q' |
   ------------------------------------------------------------

root [0] 
Attaching file results/example_combined_GaussExample_model.root as _file0...

RooFit v3.60 -- Developed by Wouter Verkerke and David Kirkby 
                Copyright (C) 2000-2013 NIKHEF, University of California & Stanford University
                All rights reserved, please read http://roofit.sourceforge.net/license.txt

(TFile *) 0x7ffaa30d2130
root [1] .ls
TFile**		results/example_combined_GaussExample_model.root	
 TFile*		results/example_combined_GaussExample_model.root	
  KEY: RooWorkspace	combined;1	combined
  KEY: TProcessID	ProcessID0;1	e1e9272e-fddb-11ea-86b3-1556a8c0beef
  KEY: TDirectoryFile	channel1_hists;1	channel1_hists
  KEY: TDirectoryFile	channel2_hists;1	channel2_hists
  KEY: RooStats::HistFactory::Measurement	GaussExample;1

from which you can extract out the necessary XML files as well:

root [2] GaussExample->PrintXML()
Printing XML Files for measurement: GaussExample
Printing XML Files for channel: channel1
Finished printing XML files
Printing XML Files for channel: channel2
Finished printing XML files
Finished printing XML files

To do this programatically, you can either write a ROOT macro

// printXML.C
int printXML() {
    TFile* _file0 = TFile::Open("results/example_combined_GaussExample_model.root");
    _file0->Get<RooStats::HistFactory::Measurement>("GaussExample")->PrintXML();

    return 0;
}

and run it

$ root -l -b -q printXML.C

but we can also do the same with PyROOT in as many lines

In [ ]:

import ROOT

_file0 = ROOT.TFile.Open("results/example_combined_GaussExample_model.root")
_file0.GaussExample.PrintXML()

which dumps them into the same directory you ran from:

$ ls -lhF
total 24K
drwxr-xr-x 2 jovyan jovyan 4.0K Nov  8 19:52 config/
drwxr-xr-x 2 jovyan jovyan 4.0K Nov  8 19:52 data/
-rw-r--r-- 1 jovyan jovyan 1.1K Nov  8 21:01 GaussExample_channel1.xml
-rw-r--r-- 1 jovyan jovyan  794 Nov  8 21:01 GaussExample_channel2.xml
-rw-r--r-- 1 jovyan jovyan  459 Nov  8 21:01 GaussExample.xml
drwxr-xr-x 2 jovyan jovyan 4.0K Nov  8 21:01 results/

In [ ]:

! ls -lhF

In [ ]:

chdir(_top_level_dir)

XML to JSON¶

via the command line¶

So pyhf comes with a lot of nifty utilities you can access. The documentation for the command line can be found via pyhf --help or online.

In [ ]:

! pyhf --help

Let's focus for now on pyhf xml2json which requires that you have installed pyhf[xmlio] (pyhf with the xmlio option).

python -m pip install pyhf[xmlio]

Again, the online documentation for this option is found here.

In [ ]:

! pyhf xml2json --help

Let's remind ourselves of what the top-level XML file looks like, as this is the ENTRYPOINT_XML.

In [ ]:

! tail -n +15 data/multichannel_histfactory/config/example.xml | cat -n

So to explain these options:

basedir specifies the base directory for where all the XML files are reference with respect to. As you can see from lines 3, 4, 5 - this should be the directory containing results/ and config/
output-file specifies the output JSON file. If one is not specified, this will print to the screen, which you can redirect into a file if you want (pyhf xml2json ... > workspace.json)
hide-progress will disable showing the progress bars when running the script... but we like progress bars 🙂

Let's go ahead and run this command, but we won't specify the output file so it goes to the screen. We'll also disable the progress tracking, just so we have a nicer output for this tutorial.

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | cat -n

Only 130 lines for the entire workspace! Not too shabby. If we look through a couple of pieces:

line 2: specify a list of channels
line 5: specify the samples for channel1
lines 6-10: specify the expected event rate for the signal sample in channel1
line 11: specify a list of modifiers (e.g. parameters that modify the sample)

Similarly, if we continue down to the second half of this JSON, we hit line 72 which specifies a list of measurements for this workspace. In fact, we only have one measurement called GaussExample with the parameter of interest defined as SigXsecOverSM. This measurement also specifies additional parameter configuration such as details for the luminosity modifier (parameter name lumi).

Nearly at the end, the next part of this specification is for the observations (observed data) on line 113. Each observation corresponds with the channel, where channel1 has two bins, and channel2 also has two bins.

Finally, we have a version which specifies the version of the schema used for the JSON HistFactory. In this case, we're using 1.0.0 which has the https://pyhf.readthedocs.io/en/v0.7.4/schemas/1.0.0/workspace.json definition which refers to the https://pyhf.readthedocs.io/en/v0.7.4/schemas/1.0.0/defs.json.

What's really nice about the schema definition is that it allows anyone to write their own tooling/scripting to build up the workspace and quickly check if it matches the schema. This will get you 90% of the way there in having a valid workspace to work with.

There are some additional checks that cannot be done, such as name conflicts, or ensuring that all samples in a channel have the same binning structure. The good news is that these checks can be done simply by loading up the workspace into a pyhf.Workspace object which will do the schema validation, as well as the additional checks.

Speaking of pyhf.Workspace objects...

via the python interface¶

Let's do the exact same thing, but from the python interpreter

In [ ]:

import pyhf
import pyhf.readxml  # not imported by default!

In [ ]:

spec = pyhf.readxml.parse(
    "data/multichannel_histfactory/config/example.xml", "data/multichannel_histfactory"
)

So we're not going to dump this out. We already did that above. Let's just quickly go ahead and load it into a pyhf.Workspace object because we can.

In [ ]:

ws = pyhf.Workspace(spec)
print(f"    channels: {ws.channels}")
print(f"       nbins: {ws.channel_nbins}")
print(f"     samples: {ws.samples}")
print(f"   modifiers: {ws.modifiers}")
print(f"observations: {ws.observations}")

Already, we're seeing a lot of information about this workspace as it's rather inspectable. Remember, this is not a model. What we call a 'model' is to combine the channel specification with a measurement... that is, a measurement of a workspace uniquely defines that model. A model might choose a particular parameter of interest to measure or set specific parameters as constant during the fit. These configurations are all stored in the measurements key we saw above. We'll explore more about models in the next chapter.

Let's move on to more things we can do with the command line.

Workspace Inspection¶

Now that we have a working command for converting our XML to JSON, let's go ahead and take advantage of the JSON output by piping it to pyhf inspect which will print out a nice summary of our workspace.

In [ ]:

! pyhf inspect --help

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf inspect

Immediately, we get a lot of useful information. We can see the number of channels, samples, parameters, and modifiers. Then we get a breakdown of the channels (and the number of bins for each channel), the samples, and the parameters. Finally, we see a list of measurements defined in the workspace, as well as the (*) denoting the default measurement if one is not specified.

Could the number of parameters and modifiers differ?

"Normalizing" a Workspace¶

There comes a time when you need to make comparisons to determine changes between two workspaces. This means depending on how the workspace is generated, one might need to "sort" it. pyhf sort is a utility that will normalize the workspace for you, such that certain operations like calculating a checksum (pyhf digest) guarantees unitarity.

For simple workspaces like the ones we're using in this tutorial, they're already sorted... however, this is not true in the real world. Notice how the bkg is now the first sample and signal is the second sample after sorting.

In [ ]:

! pyhf sort --help

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf sort

Computing a digest¶

Next up is a way to determine if two workspaces are equivalent, simply by comparing their computed digest. Note that this is based on the contents of the workspace and will not ensure floating-point differences are treated identically. That is, 2.19999999 and 2.2000001 will likely be treated as differently in the digest calculation as in python. We'll show here why sorting is very important.

In [ ]:

! pyhf digest --help

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf digest

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf sort | \
  pyhf digest

Remember that the ordering of the samples will have switched through the sorting.

The sha256 algorithm is used to compute the checksum for this workspace. This means that one can generally "normalize" all workspaces, then compute the digest and guarantee uniqueness. As with all command line functionality you've seen so far, there are equivalent ways to do it through python.

In [ ]:

print(f"Unsorted: {pyhf.utils.digest(ws)}")
print(f"Sorted:   {pyhf.utils.digest(pyhf.Workspace.sorted(ws))}")

"Pruning" away items¶

Sometimes you want to manipulate workspaces by removing channels or samples or systematics (or measurements). This can be useful when trying to debug fits, or to build background-only workspaces, or to clean up a workspace.

In [ ]:

! pyhf prune --help

prune channels¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf prune -c channel1 | \
  pyhf inspect

prune samples¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf prune -s signal | \
  pyhf inspect

prune modifiers¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf prune -m uncorrshape_signal | \
  pyhf inspect

prune modifier types¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf prune -t shapesys | \
  pyhf inspect

Renaming items¶

In addition to removing items, you might want to rename your channels, samples, modifiers, or measurement names. This can be useful for creating modifier correlations, or removing modifier correlations, or just cleaning up your workspace to get it ready for publication.

In [ ]:

! pyhf rename --help

rename channels¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf rename -c channel1 SR -c channel2 CR | \
  pyhf inspect

rename samples¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf rename -s bkg background | \
  pyhf inspect

rename modifiers¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf rename -m uncorrshape_signal corrshape -m uncorrshape_control corrshape | \
  pyhf inspect

rename measurements¶

In [ ]:

! pyhf xml2json --basedir data/multichannel_histfactory data/multichannel_histfactory/config/example.xml --hide-progress | \
  pyhf rename --measurement GaussExample FitConfig | \
  pyhf inspect