NanoEvents awkward1 demo

NanoEvents is a Coffea utility to wrap the CMS NanoAOD or similar flat nTuple structure into a single awkward array with appropriate object methods (such as Lorentz vector methods), cross references, and nested objects, all lazily accessed from the source ROOT TTree via uproot.

NanoEvents is in a experimental stage, and has been available in awkward0 for about 6 months. Quite recently, it was ported to awkward1. Here we demo using an awkward1-based NanoEvents array.

It can be instantiated using the NanoEventsFactory:

In [ ]:
import awkward1 as ak
from coffea.nanoevents import NanoEventsFactory

fname = "https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root"
cache = {}
factory = NanoEventsFactory(fname, cache=cache)
events = factory.events()

The events object is an awkward array, which at its top level is a record array with one record for each "collection", where a collection is a grouping of column (TBranch) names, categorized based on the available columns as follows:

  • one branch exists named name and no branches start with name_, interpreted as a single flat array;
  • one branch exists named name, one named n{name}, and no branches start with name_, interpreted as a single jagged array;
  • no branch exists named n{name} and many branches start with name_*, interpreted as a flat table; or
  • one branch exists named n{name} and many branches start with name_*, interpreted as a jagged table.

Any ROOT TTree that follows such a naming convention should be readable as a NanoEvents array.

For example, in the file we opened, the branches:

Generator_binvar
Generator_scalePDF
Generator_weight
Generator_x1
Generator_x2
Generator_xpdf1
Generator_xpdf2
Generator_id1
Generator_id2

are grouped into one sub-record named Generator which can be accessed using either getitem or getattr syntax, i.e. events["Generator"] or events.Generator.

In [ ]:
events.Generator.id1
In [ ]:
# all column names can be listed with:
events.Generator.columns
In [ ]:
# In CMS NanoAOD, each TBranch title is a help string, which is carried into the NanoEvents
# e.g. executing the following cell should produce a help pop-up "id of first parton"
events.Generator.id1?

Based on a collection's name, some collections acquire additional methods, which are extra features exposed by the code in the mixin classes of the nanoaod.methods module. For example, although events.GenJet has the columns:

In [ ]:
events.GenJet.columns

we can access additional attributes associated to each generated jet by virtue of the fact that they can be interpreted as Lorentz vectors:

In [ ]:
events.GenJet.energy

We can call more complex methods, like computing the distance $\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi ^2}$ between two LorentzVector objects:

In [ ]:
# find distance between leading jet and all electrons in each event
events.Jet[:, 0].delta_r(events.Electron)

The mapping from collection name to methods is controlled by NanoEventsFactory.default_mixins and can be overriden with new mappings in the NanoEventsFactory constructor, if desired. Additional methods provide convenience functions for interpreting some branches, e.g.

In [ ]:
# unpacked Jet_jetId flags
events.Jet.isTight
In [ ]:
# unpacked GenPart_statusFlags
events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])

CMS NanoAOD also contains pre-computed cross-references for some types of collections. For example, there is a TBranch Electron_genPartIdx which indexes the GenPart collection per event to give the matched generated particle, and -1 if no match is found. NanoEvents transforms these indices into an awkward indexed array pointing to the collection, so that one can directly access the matched particle using getattr syntax:

In [ ]:
events.Electron.matched_gen.pdgId
In [ ]:
events.Muon.matched_jet.pt

For generated particles, the parent index is similarly mapped:

In [ ]:
events.GenPart.parent.pdgId

In addition, using the parent index, a helper method computes the inverse mapping, namely, children. As such, one can find particle siblings with:

In [ ]:
events.GenPart.parent.children.pdgId
# notice this is a doubly-jagged array

Since often one wants to shortcut repeated particles in a decay sequence, a helper method distinctParent is also available. Here we use it to find the parent particle ID for all prompt electrons:

In [ ]:
events.GenPart[
    (abs(events.GenPart.pdgId) == 11)
    & events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])
].distinctParent.pdgId

Events can be filtered like any other awkward array using boolean fancy-indexing

In [ ]:
mmevents = events[ak.num(events.Muon) == 2]
zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]
zmm.mass

One can assign new variables to the arrays, with some caveats:

  • Assignment must use setitem (events["path", "to", "name"] = value)
  • Assignment to a sliced events won't be accessible from the original variable
  • New variables are not visible from cross-references
In [ ]:
mmevents["Electron", "myvar2"] = mmevents.Electron.pt + zmm.mass
mmevents.Electron.myvar2

Just to demonstrate that everything is lazily-accessed, here are all the cache items that have built up through the execution of this demo

In [ ]:
print("\n".join(sorted(cache.keys())))