#!/usr/bin/env python
# coding: utf-8

# # Reading data with coffea NanoEvents
# 
# This is a rendered copy of [nanoevents.ipynb](https://github.com/scikit-hep/coffea/blob/master/binder/nanoevents.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Fnanoevents.ipynb)
# 
# NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all lazily accessed$^\dagger$ from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffea-hep.readthedocs.io/en/latest/modules/coffea.nanoevents.html#classes), which are community-supplied  for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.lazy](https://uproot4.readthedocs.io/en/latest/uproot4.behaviors.TBranch.lazy.html) methods. Currently available schemas include:
# 
#    - `BaseSchema`, which provides a simple representation of the input TTree, where each branch is available verbatim as `events.branch_name`, effectively the same behavior as `uproot.lazy`.  Any branches that uproot supports at "full speed" (i.e. that are fully split and either flat or single-jagged) can be read by this schema;
#    - `NanoAODSchema`, which is optimized to provide all methods and cross-references in CMS NanoAOD format;
#    - `PFNanoAODSchema`, which builds a double-jagged particle flow candidate colllection `events.jet.constituents` from compatible PFNanoAOD input files;
#    - `TreeMakerSchema` which is designed to read TTrees made by [TreeMaker](https://github.com/TreeMaker/TreeMaker), an alternative CMS nTuplization format;
#    - `PHYSLITESchema`, for the ATLAS DAOD_PHYSLITE derivation, a compact centrally-produced data format similar to CMS NanoAOD; and
#    - `DelphesSchema`, for reading Delphes fast simulation [nTuples](https://cp3.irmp.ucl.ac.be/projects/delphes/wiki/WorkBook/RootTreeDescription).
# 
# We welcome contributions for new schemas, and can assist with the design of them.
# 
# $^*$ Vector methods are currently made possible via the [coffea vector](https://coffea-hep.readthedocs.io/en/latest/modules/coffea.nanoevents.methods.vector.html) methods mixin class structure. In a future version of coffea, they will instead be provided by the dedicated scikit-hep [vector](https://vector.readthedocs.io/en/latest/) library, which provides a more rich feature set. The coffea vector methods predate the release of the vector library.
# 
# $^\dagger$ _Lazy_ access refers to only fetching the needed data from the (possibly remote) file when a sub-array is first accessed. The sub-array is then _materialized_ and subsequent access of the sub-array uses a cached value in memory. As such, fully materializing a `NanoEvents` object may require a significant amount of memory.
# 
# 
# In this demo, we will use NanoEvents to read a small CMS NanoAOD sample. The events object can be instantiated as follows:

# In[1]:


import awkward as ak
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema

NanoAODSchema.warn_missing_crossrefs = False

fname = "https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root"
events = NanoEventsFactory.from_root(
    {fname: "Events"},
    schemaclass=NanoAODSchema,
    metadata={"dataset": "DYJets"},
).events()


# In the factory constructor, we also pass the desired schema version (the latest version of NanoAOD can be built with `schemaclass=NanoAODSchema`) for this file and some extra metadata that we can later access with `events.metadata`. In a later example, we will show how to set up this metadata in coffea processors where the `events` object is pre-created for you. Consider looking at the [from_root](https://coffea-hep.readthedocs.io/en/latest/api/coffea.nanoevents.NanoEventsFactory.html#coffea.nanoevents.NanoEventsFactory.from_root) class method to see all optional arguments.
# 
# The `events` object is an awkward array, which at its top level is a record array with one record for each "collection", where a collection is a grouping of fields (TBranches) based on the naming conventions of [NanoAODSchema](https://coffea-hep.readthedocs.io/en/latest/api/coffea.nanoevents.NanoAODSchema.html). For example, in the file we opened, the branches:
# ```
# Generator_binvar
# Generator_scalePDF
# Generator_weight
# Generator_x1
# Generator_x2
# Generator_xpdf1
# Generator_xpdf2
# Generator_id1
# Generator_id2
# ```
# are grouped into one sub-record named `Generator` which can be accessed using either getitem or getattr syntax, i.e. `events["Generator"]` or `events.Generator`. e.g.

# In[2]:


events.Generator.id1.compute()


# In[3]:


# all names can be listed with:
events.Generator.fields


# In CMS NanoAOD, each TBranch has a self-documenting help string embedded in the title field, which is carried into the NanoEvents, e.g. executing the following cell should produce a help pop-up:
# ```
# Type:            Array
# String form:     [1, -1, -1, 21, 21, 4, 2, -2, 2, 1, 3, 1, ... -1, -1, 1, -2, 2, 1, 2, -2, -1, 2, 1]
# Length:          40
# File:            ~/src/awkward-1.0/awkward1/highlevel.py
# Docstring:       id of first parton
# Class docstring: ...
# ```
# where the `Docstring` shows information about the content of this array.

# In[4]:


get_ipython().run_line_magic('pinfo', 'events.Generator.id1')


# Based on a collection's name or contents, some collections acquire additional _methods_, which are extra features exposed by the code in the mixin classes of the `coffea.nanoevents.methods` modules. For example, although `events.GenJet` has the fields:

# In[5]:


events.GenJet.fields


# we can access additional attributes associated to each generated jet by virtue of the fact that they can be interpreted as [Lorentz vectors](https://coffea-hep.readthedocs.io/en/latest/api/coffea.nanoevents.methods.vector.LorentzVector.html#coffea.nanoevents.methods.vector.LorentzVector):

# In[6]:


events.GenJet.energy.compute()


# We can call more complex methods, like computing the distance $\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi ^2}$ between two LorentzVector objects:

# In[7]:


# find distance between leading jet and all electrons in each event
dr = events.Jet[:, 0].delta_r(events.Electron)
dr.compute()


# In[8]:


# find minimum distance
ak.min(dr, axis=1).compute()


# In[9]:


# a convenience method for this operation on all jets is available
print(events.Jet.nearest(events.Electron).compute())


# The assignment of methods classes to collections is done inside the schema object during the initial creation of the array, governed by the awkward array's `__record__` parameter and the associated behavior. See [ak.behavior](https://awkward-array.readthedocs.io/en/latest/ak.behavior.html) for a more detailed explanation of array behaviors.
# 
# Additional methods provide convenience functions for interpreting some branches, e.g. CMS NanoAOD packs several jet identification flag bits into a single integer, `jetId`. By implementing the bit-twiddling in the [Jet mixin](https://github.com/scikit-hep/coffea/blob/7045c06b9448d2be4315e65d432e6d8bd117d6d7/coffea/nanoevents/methods/nanoaod.py#L279-L282), the analsyis code becomes more clear:

# In[10]:


print(events.Jet.jetId.compute())
print(events.Jet.isTight.compute())


# We can also define convenience functions to unpack and apply some mask to a set of flags, e.g. for generated particles:

# In[11]:


print(f"Raw status flags: {events.GenPart.statusFlags.compute()}")
events.GenPart.hasFlags(['isPrompt', 'isLastCopy']);


# CMS NanoAOD also contains pre-computed cross-references for some types of collections. For example, there is a TBranch `Electron_genPartIdx` which indexes the `GenPart` collection per event to give the matched generated particle, and `-1` if no match is found. NanoEvents transforms these indices into an awkward _indexed array_ pointing to the collection, so that one can directly access the matched particle using getattr syntax:

# In[12]:


events.Electron.matched_gen.pdgId.compute()


# In[13]:


events.Muon[ak.num(events.Muon)>0].matched_jet.pt.compute()


# For generated particles, the parent index is similarly mapped:

# In[14]:


events.GenPart.parent.pdgId.compute()


# In addition, using the parent index, a helper method computes the inverse mapping, namely, `children`. As such, one can find particle siblings with:

# In[15]:


events.GenPart.parent.children.pdgId.compute()
# notice this is a doubly-jagged array


# Since often one wants to shortcut repeated particles in a decay sequence, a helper method `distinctParent` is also available. Here we use it to find the parent particle ID for all prompt electrons:

# In[16]:


events.GenPart[
    (abs(events.GenPart.pdgId) == 11)
    & events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])
].distinctParent.pdgId.compute()


# Events can be filtered like any other awkward array using boolean fancy-indexing

# In[17]:


mmevents = events[ak.num(events.Muon) == 2]
zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]
zmm.mass.compute()


# In[18]:


# a convenience method is available to sum vectors along an axis:
mmevents.Muon.sum(axis=1).mass.compute()


# As expected for this sample, most of the dimuon events have a pair invariant mass close to that of a Z boson. But what about the last event? Let's take a look at the generator information:

# In[19]:


print(mmevents[-1].Muon.matched_gen.pdgId.compute())
print(mmevents[-1].Muon.matched_gen.hasFlags(["isPrompt"]).compute())


# So they are real generated muons, but they are not prompt (i.e. from the initial decay of a heavy resonance)
# 
# Let's look at their parent particles:

# In[20]:


mmevents[-1].Muon.matched_gen.parent.pdgId.compute()


# aha! They are muons coming from tau lepton decays, and hence a fair amount of the Z mass is carried away by the neutrinos:

# In[21]:


print(mmevents[-1].Muon.matched_gen.sum().mass.compute())
print(mmevents[-1].Muon.matched_gen.parent.sum().mass.compute())


# One can assign new variables to the arrays, with some caveats:
# 
#  * Assignment must use setitem (`events["path", "to", "name"] = value`)
#  * Assignment to a sliced `events` won't be accessible from the original variable
#  * New variables are not visible from cross-references

# In[22]:


mmevents["Electron", "myvariable"] = mmevents.Electron.pt + zmm.mass
mmevents.Electron.myvariable.compute()