Notebook

PackedSelection Updates¶

In coffea, PackedSelection is a class that can store several boolean arrays in a memory-efficient mannner and evaluate arbitrary combinations of boolean requirements in an CPU-efficient way. Supported inputs include 1D numpy or awkward arrays. This makes it a good tool to form analysis in signal and control regions, and to implement cutflow or "N-1" plots.

This was previously only a boolean array evaluation class. Now more features are being added to include support for a full cutflow of an analysis. My work is in this repository

I'm gonna base this notebook on the tutorial on Coffea given by link Nick Manganelli in this hackathon.

We'll use the same small sample file of 40 Drell-Yan events to demonstrate the utilities. First of all, let's read this file using our NanoAODSchema as our schema.

In [1]:

import numpy as np
import awkward as ak
from matplotlib import pyplot as plt
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema


fname = "coffea/tests/samples/nano_dy.root"
# or remote file just takes way longer to read atm
# fname = "https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root"
events = NanoEventsFactory.from_root(
    fname,
    schemaclass=NanoAODSchema.v6,
    metadata={"dataset": "DYJets"},
).events()

events

Out[1]:

[{Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 ...,
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...}]
----------------------------------------------------------------
type: 40 * event

Now let's import PackedSelection, create an instance of it and add some boolean arrays to it.

In [2]:

from coffea.analysis_tools import PackedSelection

selection = PackedSelection()

selection.add_multiple({"twoElectron": ak.num(events.Electron) == 2,
                        "eleOppSign": ak.sum(events.Electron.charge, axis=1) == 0,
                        "noElectron": ak.num(events.Electron) == 0,
                        "twoMuon": ak.num(events.Muon) == 2,
                        "muOppSign": ak.sum(events.Muon.charge, axis=1) == 0, 
                        "noMuon": ak.num(events.Muon) == 0, 
                        "leadPt20": ak.any(events.Electron.pt >= 20.0, axis=1) | ak.any(events.Muon.pt >= 20.0, axis=1)})

print(selection.names)

['twoElectron', 'eleOppSign', 'noElectron', 'twoMuon', 'muOppSign', 'noMuon', 'leadPt20']

The only thing that is different so far is that we can call add_multiple to add multiple boolean arrays at once using a dictionary of name: selection pairs instead of calling the add method multiple times. As before, we can still normally do stuff like:

In [3]:

print(selection.all("twoElectron", "noMuon", "leadPt20"))
print(selection.require(twoElectron=True, noMuon=True, eleOppSign=False))

[False False  True False False False False False False False False False
 False False False False False False False False  True  True False False
 False False False False False False False False False False False False
 False False False False]
[False False False  True False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False]

Now let's move unto new features. We are now able to perform an N-1 style selection using the nminusone(*names) function. This will perform an N-1 style selection by using as "N" the provided names and will exclude each named cut one at a time in order. In the end it will also peform a selection using all N cuts.

In [4]:

nminusone = selection.nminusone("twoElectron", "noMuon", "leadPt20")
nminusone

Out[4]:

NminusOne(selections=('twoElectron', 'noMuon', 'leadPt20'))

This returns an NminusOne object which has the following methods: result(), print(), yieldhist() and plot_vars().

Let's look at the results of the N-1 selection

In [5]:

res = nminusone.result()
print(type(res), res._fields)

<class 'coffea.analysis_tools.NminusOneResult'> ('labels', 'nev', 'masks')

This is just a namedtuple with the attributes labels, nev and masks. So we can say:

In [6]:

labels, nev, masks = res
labels, nev, masks

Out[6]:

(['initial', 'N - twoElectron', 'N - noMuon', 'N - leadPt20', 'N'],
 [40, 10, 3, 5, 3],
 [array([False,  True,  True, False, False, False, False, False, False,
          True, False, False, False, False, False,  True,  True, False,
         False, False,  True,  True, False, False, False, False, False,
          True, False,  True, False, False, False,  True, False, False,
         False, False, False, False]),
  array([False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False]),
  array([False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
          True, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False]),
  array([False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False])])

The labels is a list of labels of each mask that is applied, nev is a list of the number of events that survives each mask and masks is a list of boolean masks (arrays) of which events survive each selection. You can also choose to print the statistics of your N-1 selection in a similar fashion to RDataFrame:

In [7]:

nminusone.print()

N-1 selection stats:
Ignoring twoElectron         : pass = 10                                  all = 40                                  -- eff = 25.0 %
Ignoring noMuon              : pass = 3                                   all = 40                                  -- eff = 7.5 %
Ignoring leadPt20            : pass = 5                                   all = 40                                  -- eff = 12.5 %
All cuts                     : pass = 3                                   all = 40                                  -- eff = 7.5 %

Or get a histogram of your total event yields. This just returns a hist.Hist object and we can plot it with its backends to mplhep.

In [8]:

h, labels = nminusone.yieldhist()
h.plot1d()
plt.xticks(plt.gca().get_xticks(), labels, rotation = 45);

Finally, we can ask from this object to create histograms of different variables, masking them with our N-1 selection. What it will output is a list of histograms, one for each requested variable, where the x-axis is the distribution of the variable, and the y-axis is the mask that was applied. It is essentially slices of how the variable distribution evolves as each N-1 or N selection is applied. It does also return a list of labels of the masks to keep track.

Note that the variables are parsed using a dictonary of name: array pairs and that the arrays will of course be flattened to be histogrammed.

In [9]:

hs, labels = nminusone.plot_vars({'Ept': events.Electron.pt,
                                  'Ephi': events.Electron.phi})
hs, labels

Out[9]:

([Hist(
    Regular(20, 5.8189, 60.0685, name='Ept'),
    IntCategory([0, 1, 2, 3, 4], name='N-1'),
    storage=Double()) # Sum: 60.0,
  Hist(
    Regular(20, -2.93116, 3.11866, name='Ephi'),
    IntCategory([0, 1, 2, 3, 4], name='N-1'),
    storage=Double()) # Sum: 60.0],
 ['initial', 'N - twoElectron', 'N - noMuon', 'N - leadPt20', 'N'])

And we can actually plot those histograms using again the mplhep backend.

In [10]:

for h in hs:
    h.plot2d()
    plt.yticks(plt.gca().get_yticks(), labels, rotation = 0);
    plt.show()

Cutflow is implemented in a similar manner to the N-1 selection. We just have to use the cutflow(*names) function which will return a Cutflow object

In [11]:

cutflow = selection.cutflow("noMuon", "twoElectron", "leadPt20")
cutflow

Out[11]:

Cutflow(selections=('noMuon', 'twoElectron', 'leadPt20'))

The methods of this object are similar to the NminusOne object. The only difference is that now we seperate things in either "onecut" or "cutflow". "onecut" represents results where each cut is applied alone, while "cutflow" represents results where the cuts are applied cumulatively in order.

In [12]:

res = cutflow.result()
print(type(res), res._fields)
labels, nevonecut, nevcutflow, masksonecut, maskscutflow = res
labels, nevonecut, nevcutflow, masksonecut, maskscutflow

<class 'coffea.analysis_tools.CutflowResult'> ('labels', 'nevonecut', 'nevcutflow', 'masksonecut', 'maskscutflow')

Out[12]:

(['initial', 'noMuon', 'twoElectron', 'leadPt20'],
 [40, 28, 5, 17],
 [40, 28, 5, 3],
 [array([ True,  True,  True,  True, False, False, False,  True,  True,
          True, False,  True,  True,  True, False,  True,  True,  True,
          True,  True,  True,  True,  True, False, False,  True, False,
          True, False,  True, False, False,  True,  True, False,  True,
          True,  True,  True,  True]),
  array([False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
          True, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False]),
  array([False,  True,  True, False,  True,  True,  True, False, False,
          True, False, False, False, False, False,  True,  True, False,
         False, False,  True,  True, False,  True,  True, False,  True,
          True, False,  True, False,  True, False,  True, False, False,
         False, False, False, False])],
 [array([ True,  True,  True,  True, False, False, False,  True,  True,
          True, False,  True,  True,  True, False,  True,  True,  True,
          True,  True,  True,  True,  True, False, False,  True, False,
          True, False,  True, False, False,  True,  True, False,  True,
          True,  True,  True,  True]),
  array([False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
          True, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False]),
  array([False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False])])

As you can see, again we have the same labels, nev and masks only now we have two "versions" of them since they've been split into "onecut" and "cutflow". You can again print the statistics of the cutflow exactly like RDataFrame.

In [13]:

cutflow.print()

Cutflow stats:
Cut noMuon              : pass = 28                                  cumulative pass = 28                                  all = 40                                  --  eff = 70.0 %                -- cumulative eff = 70.0 %
Cut twoElectron         : pass = 5                                   cumulative pass = 5                                   all = 40                                  --  eff = 12.5 %                -- cumulative eff = 12.5 %
Cut leadPt20            : pass = 17                                  cumulative pass = 3                                   all = 40                                  --  eff = 42.5 %                -- cumulative eff = 7.5 %

Again you can extract yield hists only now there are two of them.

In [14]:

honecut, hcutflow, labels = cutflow.yieldhist()

honecut.plot1d(yerr=0)
plt.xticks(plt.gca().get_xticks(), labels, rotation = 45)
plt.show()

hcutflow.plot1d(yerr=0)
plt.xticks(plt.gca().get_xticks(), labels, rotation = 45)
plt.show()

And finally, plot_vars is also there while now it returns two lists of histograms, one for "onecut" and one for "cutflow". Those can of course e plotted in a similar fashion.

In [15]:

h1, h2, labels = cutflow.plot_vars({'ept': events.Electron.pt,
                                    'ephi': events.Electron.phi})
h1, h2, labels

Out[15]:

([Hist(
    Regular(20, 5.8189, 60.0685, name='ept'),
    IntCategory([0, 1, 2, 3], name='onecut'),
    storage=Double()) # Sum: 73.0,
  Hist(
    Regular(20, -2.93116, 3.11866, name='ephi'),
    IntCategory([0, 1, 2, 3], name='onecut'),
    storage=Double()) # Sum: 73.0],
 [Hist(
    Regular(20, 5.8189, 60.0685, name='ept'),
    IntCategory([0, 1, 2, 3], name='cutflow'),
    storage=Double()) # Sum: 63.0,
  Hist(
    Regular(20, -2.93116, 3.11866, name='ephi'),
    IntCategory([0, 1, 2, 3], name='cutflow'),
    storage=Double()) # Sum: 63.0],
 ['initial', 'noMuon', 'twoElectron', 'leadPt20'])

Now, PackedSelection can also operate in delayed or lazy mode and full support for dask_awkward arrays. Use is still the same, but everything now is a delayed dask type object which can be computed whenever the user wants to. This can be done by either calling .compute() on the object or dask.compute(*things).

PackedSelection can be initialized to operate in delayed mode by adding a delayed dask_awkward array for the first time instead of a materialized numpy or awkward one. I would like to note that we only support delayed dask_awkward arrays and not dask.array arrays. Please convert your dask arrays to dask_awkward via dask_awkward.from_dask_array(array). I would also like to note that you cannot mix materialized and delayed arrays in the same PackedSelection. Let's convert our events to a delayed array and perform the exact same things.

In [16]:

import dask_awkward as dak
import dask.array as da
import dask

dakevents = dak.from_awkward(events, 1)
dakevents

Out[16]:

dask.awkward<from-awkward, npartitions=1>

Now dakevents is a delayed dask_awkward version of our events and if we compute it we get our normal events

In [17]:

dakevents.compute()

Out[17]:

[{Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 ...,
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...},
 {Flag: {ecalBadCalibFilterV2: True, ...}, GenPart: [...], ...}]
----------------------------------------------------------------
type: 40 * event

Now we have to use dask_awkward instead of awkward and dakevents instead of events. For instance, dak.num(dakevents.Electron) == 2 will be a delayed array that has to be computed

In [18]:

print(dak.num(dakevents.Electron) == 2)
(dak.num(dakevents.Electron) == 2).compute()

dask.awkward<equal, npartitions=1>

Out[18]:

[False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 ...,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]
---------------
type: 40 * bool

Now let's add the same (now delayed) arrays to PackedSelection

In [19]:

selection = PackedSelection()

selection.add_multiple({"twoElectron": dak.num(dakevents.Electron) == 2,
                        "eleOppSign": dak.sum(dakevents.Electron.charge, axis=1) == 0,
                        "noElectron": dak.num(dakevents.Electron) == 0,
                        "twoMuon": dak.num(dakevents.Muon) == 2,
                        "muOppSign": dak.sum(dakevents.Muon.charge, axis=1) == 0, 
                        "noMuon": dak.num(dakevents.Muon) == 0, 
                        "leadPt20": dak.any(dakevents.Electron.pt >= 20.0, axis=1) | dak.any(dakevents.Muon.pt >= 20.0, axis=1)})

print(selection.names)

['twoElectron', 'eleOppSign', 'noElectron', 'twoMuon', 'muOppSign', 'noMuon', 'leadPt20']

Now, the same functions will return dask_awkward objects that have to be computed.

In [20]:

selection.all("twoElectron", "noMuon", "leadPt20")

Out[20]:

dask.awkward<equal, npartitions=1>

dask in the background builds a DAG and knows how to peform these calculations but doesn't do so until we call compute. We can visualize this graph by calling visualize.

In [21]:

selection.all("twoElectron", "noMuon", "leadPt20").visualize()

Out[21]:

When computing those arrays we should get the same arrays that we got when operating in eager mode.

In [22]:

print(selection.all("twoElectron", "noMuon", "leadPt20").compute())
print(selection.require(twoElectron=True, noMuon=True, eleOppSign=False).compute())

[False, False, True, False, False, ..., False, False, False, False, False]
[False, False, False, True, False, ..., False, False, False, False, False]

Now, N-1 and cutflow will just return only delayed objects that must be computed

In [23]:

nminusone = selection.nminusone("twoElectron", "noMuon", "leadPt20")
nminusone

Out[23]:

NminusOne(selections=('twoElectron', 'noMuon', 'leadPt20'))

It is again an NminusOne object which has the same methods.

Let's look at the results of the N-1 selection in the same way

In [24]:

labels, nev, masks = nminusone.result()
labels, nev, masks

Out[24]:

(['initial', 'N - twoElectron', 'N - noMuon', 'N - leadPt20', 'N'],
 [dask.awkward<sum-agg, type=Scalar, dtype=int64>,
  dask.awkward<sum-agg, type=Scalar, dtype=int64>,
  dask.awkward<sum-agg, type=Scalar, dtype=int64>,
  dask.awkward<sum-agg, type=Scalar, dtype=int64>,
  dask.awkward<sum-agg, type=Scalar, dtype=int64>],
 [dask.awkward<equal, npartitions=1>,
  dask.awkward<equal, npartitions=1>,
  dask.awkward<equal, npartitions=1>,
  dask.awkward<equal, npartitions=1>])

Now however, you can see that everything is a dask awkward object (apart from the labels of course). If we compute them we should get the same things as before and indeed we do

In [25]:

dask.compute(*nev), dask.compute(*masks)

Out[25]:

((40, 10, 3, 5, 3),
 (<Array [False, True, True, False, ..., False, False, False] type='40 * bool'>,
  <Array [False, False, True, False, ..., False, False, False] type='40 * bool'>,
  <Array [False, False, True, True, ..., False, False, False] type='40 * bool'>,
  <Array [False, False, True, False, ..., False, False, False] type='40 * bool'>))

We can again print the statistics, however for this to happen, the object must of course compute the delayed nev list

In [26]:

nminusone.print()

N-1 selection stats:
Ignoring twoElectron         : pass = 10                                  all = 40                                  -- eff = 25.0 %
Ignoring noMuon              : pass = 3                                   all = 40                                  -- eff = 7.5 %
Ignoring leadPt20            : pass = 5                                   all = 40                                  -- eff = 12.5 %
All cuts                     : pass = 3                                   all = 40                                  -- eff = 7.5 %

And now if we call result() again, the nev list is materialized

In [27]:

nminusone.result().nev

Out[27]:

[40, 10, 3, 5, 3]

Again the histogram of your total event yields works. This time it is returns a hist.dask.Hist object.

In [28]:

h, labels = nminusone.yieldhist()
h

Out[28]:

IntCategory([0, 1, 2, 3, 4], growth=True, name='N-1')

Double() Σ=0.0

It appears empty because it hasn't been computed yet. Let's do that.

In [29]:

h.compute()

Out[29]:

IntCategory([0, 1, 2, 3, 4], growth=True, name='N-1')

Double() Σ=61.0

Notice that this doesn't happen in place as h is still not computed.

In [30]:

Out[30]:

IntCategory([0, 1, 2, 3, 4], growth=True, name='N-1')

Double() Σ=0.0

We can again plot this histogram only we have to call plot on the computed one, otherwise it will just be empty.

In [31]:

h.compute().plot1d()
plt.xticks(plt.gca().get_xticks(), labels, rotation = 45);

And we got exactly the same thing. Same logic applies to the plot_vars function. Remember to use dakevents now and not events.

In [32]:

hs, labels = nminusone.plot_vars({'Ept': dakevents.Electron.pt,
                                  'Ephi': dakevents.Electron.phi})
hs, labels

Out[32]:

([Hist(
    Regular(20, 5.8189, 60.0685, name='Ept'),
    IntCategory([0, 1, 2, 3, 4], name='N-1'),
    storage=Double()) # (has staged fills),
  Hist(
    Regular(20, -2.93116, 3.11866, name='Ephi'),
    IntCategory([0, 1, 2, 3, 4], name='N-1'),
    storage=Double()) # (has staged fills)],
 ['initial', 'N - twoElectron', 'N - noMuon', 'N - leadPt20', 'N'])

Those histograms are also delayed and have to be computed before plotting them

Exactly the same things apply to the cutflow in delayed mode. However, I will not repeat this for the sake of time and space in this notebook.

All this requires more testing, documentation, and unittests, but I hope to be able to start a Pull Request to merge into the coffea release repository very soon.