NanoEvents is a Coffea utility to wrap the CMS NanoAOD or similar flat nTuple structure into a single awkward array with appropriate object methods (such as Lorentz vector methods), cross references, and nested objects, all lazily accessed from the source ROOT TTree via uproot.
NanoEvents is in a experimental stage, and has been available in awkward0 for about 6 months. Quite recently, it was ported to awkward1. Here we demo using an awkward1-based NanoEvents array.
It can be instantiated using the NanoEventsFactory:
import awkward1 as ak
from coffea.nanoevents import NanoEventsFactory
fname = "https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root"
cache = {}
factory = NanoEventsFactory(fname, cache=cache)
events = factory.events()
The events
object is an awkward array, which at its top level is a record array with one record for each "collection", where a collection is a grouping of column (TBranch) names, categorized based on the available columns as follows:
name
and no branches start with name_
, interpreted as a single flat array;name
, one named n{name}
, and no branches start with name_
, interpreted as a single jagged array;n{name}
and many branches start with name_*
, interpreted as a flat table; orn{name}
and many branches start with name_*
, interpreted as a jagged table.Any ROOT TTree that follows such a naming convention should be readable as a NanoEvents array.
For example, in the file we opened, the branches:
Generator_binvar
Generator_scalePDF
Generator_weight
Generator_x1
Generator_x2
Generator_xpdf1
Generator_xpdf2
Generator_id1
Generator_id2
are grouped into one sub-record named Generator
which can be accessed using either getitem or getattr syntax, i.e. events["Generator"]
or events.Generator
.
events.Generator.id1
# all column names can be listed with:
events.Generator.columns
# In CMS NanoAOD, each TBranch title is a help string, which is carried into the NanoEvents
# e.g. executing the following cell should produce a help pop-up "id of first parton"
events.Generator.id1?
Based on a collection's name, some collections acquire additional methods, which are extra features exposed by the code in the mixin classes of the nanoaod.methods module. For example, although events.GenJet
has the columns:
events.GenJet.columns
we can access additional attributes associated to each generated jet by virtue of the fact that they can be interpreted as Lorentz vectors:
events.GenJet.energy
We can call more complex methods, like computing the distance $\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi ^2}$ between two LorentzVector objects:
# find distance between leading jet and all electrons in each event
events.Jet[:, 0].delta_r(events.Electron)
The mapping from collection name to methods is controlled by NanoEventsFactory.default_mixins and can be overriden with new mappings in the NanoEventsFactory constructor, if desired. Additional methods provide convenience functions for interpreting some branches, e.g.
# unpacked Jet_jetId flags
events.Jet.isTight
# unpacked GenPart_statusFlags
events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])
CMS NanoAOD also contains pre-computed cross-references for some types of collections. For example, there is a TBranch Electron_genPartIdx
which indexes the GenPart
collection per event to give the matched generated particle, and -1
if no match is found. NanoEvents transforms these indices into an awkward indexed array pointing to the collection, so that one can directly access the matched particle using getattr syntax:
events.Electron.matched_gen.pdgId
events.Muon.matched_jet.pt
For generated particles, the parent index is similarly mapped:
events.GenPart.parent.pdgId
In addition, using the parent index, a helper method computes the inverse mapping, namely, children
. As such, one can find particle siblings with:
events.GenPart.parent.children.pdgId
# notice this is a doubly-jagged array
Since often one wants to shortcut repeated particles in a decay sequence, a helper method distinctParent
is also available. Here we use it to find the parent particle ID for all prompt electrons:
events.GenPart[
(abs(events.GenPart.pdgId) == 11)
& events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])
].distinctParent.pdgId
Events can be filtered like any other awkward array using boolean fancy-indexing
mmevents = events[ak.num(events.Muon) == 2]
zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]
zmm.mass
One can assign new variables to the arrays, with some caveats:
events["path", "to", "name"] = value
)events
won't be accessible from the original variablemmevents["Electron", "myvar2"] = mmevents.Electron.pt + zmm.mass
mmevents.Electron.myvar2
Just to demonstrate that everything is lazily-accessed, here are all the cache items that have built up through the execution of this demo
print("\n".join(sorted(cache.keys())))