Files, data, event processing

Quick intro to the following packages¶

uproot - ROOT I/O library in pure Python and NumPy.
awkward-array - manipulate arrays of complex data structures as easily as NumPy.

ROOT I/O in pure Python and NumPy

What is `uproot`?¶

Effectively what connects HEP data (ROOT format) with the Python scientific ecosystem around NumPy!

uproot provides very fast, efficient, and convenient access to ROOT trees.

Pure Python + NumPy implementation of ROOT I/O.
An array-centric view of ROOT TTree data:
- Branches of simple types are simple arrays.
- Branches of complex types are “jagged arrays”.
High performance for large baskets, despite Python’s slowness (because all per-entry operations are performed in NumPy).
Greatest benefits: simplicity, minimal installation, set-up, and affinity with machine learning interfaces.

Why does it exist?¶

To extract columnar data (branches) from a ROOT file without invoking the

event-handling infrastructure of the ROOT framework. 3. To express the semantics and conventions of the ROOT file format independently of ROOT, in lieu of a formal specification.

Why Python + NumPy?¶

As stressed several times, the scientific Python ecosystem, including much of ML, is designed around a fundamental abstraction called the NumPy array.
Working with computer scientists is easier when you can say, "pip install uproot".
Implemented correctly, Python + NumPy doesn't have to be slow.
- Finding the columnar data in a ROOT file may be done in slow Python, as long as

decompression and array manipulations are done by compiled code, see the now-old-ish performance study below.

1. Getting data from a simple ROOT TTree¶

By "simple" we mean a file without jagged structures, or nested structures with branch sizes depending on an event-by-event basis.

In [ ]:

import uproot

In [ ]:

f = uproot.open('data/sample_simple-example.root')

f

ROOT files, directories, and trees are like Python dicts with keys() and values().

In [ ]:

f.keys()

In [ ]:

t = f["events"]
t.keys()

In [ ]:

t['M']

Uproot's main purpose is to read branches from ROOT files as NumPy arrays:

In [ ]:

t['M'].array()

All branches can be looked at with t.arrays(). A subset is specified e.g. as t.arrays(['Run', 'Event']):

In [ ]:

t.arrays()

One can now start performing calculations. But let's avoid explicit loops and rather exploit the ecosystem ...

Create a Pandas DataFrame

In [ ]:

import pandas

df = t.arrays(library="pd")
df

In [ ]:

t.arrays(['Run', 'Event', 'pt1', 'pt2'], library="pd").head()

2. Data with jagged structure¶

In [ ]:

f = uproot.open('data/sample_muons.root')

f.keys()

In [ ]:

branches = f['Events'].arrays()
                               
branches

The jagged structure here comes from the number of muons per event, which is variable:

In [ ]:

branches['nMuon']

This becomes evident when checking for example the $p_T$ of all muons:

In [ ]:

branches['Muon_pt']

Print the $p_T$ for the muons in the first 10 events to trivially see the jagged structure:

In [ ]:

print(' \n'.join([str(elm) for elm in branches['Muon_pt'][:10]]))

We will get back to jagged arrays in a sec. Let's first show that uproot also has (limited) writing functionality.

Manipulate arrays of complex data structures as easily as NumPy

What is `awkward-array`?¶

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Example - di-muons in a sample of selected muons¶

As seen above, the number of muons varies on an event-by-event basis:

In [ ]:

branches['nMuon']

Can we see any structure in those events containing a $\mu^+ \mu^-$ pair? Interesting since many particles are known to decay as such, e.g. $J/\psi \to \mu^+ \mu^-$.

Let's first investigate the sample a bit further.

In [ ]:

import awkward as ak

In [ ]:

branches

In [ ]:

ak.type(branches)

1000000 * means that there are a 100k events, "Muon_pt": var * means that the contents of the "Muon_pt" field are jagged: there's a variable number of them per event.

We could look at a few of these as Python lists and dicts.

In [ ]:

ak.to_list(branches[:3])

In [ ]:

branches['nMuon'][0]

Nothing like a visual inspection, though:

In [ ]:

import matplotlib.pyplot as plt

plt.hist(branches['nMuon'], bins=10, range=(0, 10))
plt.xlabel('Number of muons in event')
plt.ylabel('Number of events');

Quickly do the same with Scikit-HEP histogramming tools (raison d'être of the Hist package)?

In [ ]:

from hist import Hist
Hist.new.Reg(10, 0 ,10, name="nMuon", label="Number of muons in event").Double().fill(branches['nMuon'])

The above even tells us something we did not realise from the other plot - there are 110 overflow entries. Indeed

In [ ]:

ak.count(branches['nMuon'][branches['nMuon'] >=10])

How many muon entries are there in total?

In [ ]:

# ak.num gives the number of elements in each nested list
len(branches['Muon_pt']), ak.sum(branches['nMuon']), sum(ak.num(branches['Muon_pt']))  # 235286 muons in 1e5 events

In [ ]:

sum(branches.nMuon)

In [ ]:

ak.type(branches.Muon_pt)

Plot the $p_T$ and $\eta$ of all muons:

In [ ]:

plt.hist(ak.flatten(branches.Muon_pt), bins=100, range=(0, 100))
plt.xlabel('Muon pT')
plt.ylabel('Number of candidates')
plt.yscale('log');

In [ ]:

plt.hist(ak.flatten(branches.Muon_eta), bins=100, range=(-2.5, 2.5))
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of candidates');

In [ ]:

%%timeit
len(ak.flatten(branches.Muon_pt))

In [ ]:

%%timeit
sum(branches.nMuon)

Selections are done via masks. Let's create one that singles out events with a single muon:

In [ ]:

branches['nMuon'] == 1

In [ ]:

single_muon_mask = branches['nMuon'] == 1

print("There are {} single-muon events.".format(sum(single_muon_mask)))

Just checking:

In [ ]:

len(branches['Muon_pt'][single_muon_mask]), branches['Muon_pt'][single_muon_mask]

In [ ]:

plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [MeV]')
plt.ylabel('Number of single muons / 1 MeV')
plt.yscale('log')
plt.show()

Mask to select muons within $|\eta| <2$:

In [ ]:

eta_mask = abs(branches['Muon_eta']) < 2
eta_mask

In [ ]:

ak.sum(ak.flatten(eta_mask))

In [ ]:

ak.num(eta_mask)

In [ ]:

ak.sum(eta_mask)

Again, a visual inspection never harms.

In [ ]:

plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))
plt.title('No selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()

plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))
plt.title('With $|\eta| < 2$ selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()

In [ ]:

len(single_muon_mask & eta_mask)

In [ ]:

plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
          ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
         label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
         density=True,
         bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()

At last, concentrate on 2-muon events:

In [ ]:

two_muons_mask = branches['nMuon'] == 2
two_muons_sample = branches[two_muons_mask]

In [ ]:

sum(two_muons_mask)

In [ ]:

two_muons_sample['Muon_pt']

In [ ]:

import uproot3_methods

first_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,0],
                                                                   two_muons_sample['Muon_eta'][:,0],
                                                                   two_muons_sample['Muon_phi'][:,0],
                                                                   two_muons_sample['Muon_mass'][:,0]
                                                                 )
first_muon_p4

In [ ]:

second_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,1],
                                                                    two_muons_sample['Muon_eta'][:,1],
                                                                    two_muons_sample['Muon_phi'][:,1],
                                                                    two_muons_sample['Muon_mass'][:,1]
                                                                 )
second_muon_p4

In [ ]:

len(first_muon_p4)

In [ ]:

first_muon_p4.delta_r(second_muon_p4)

In [ ]:

plt.hist(first_muon_p4.delta_r(second_muon_p4), bins=100)
plt.xlabel('$\Delta R$ between muons')
plt.ylabel('Number of two-muon events')
plt.show()

Are we done? No! We have not checked that the 2 muons have opposite charge ... Further refine to $\mu^+\mu^-$ pairs - you see where're getting ;-):

In [ ]:

sum_p4 = first_muon_p4 + second_muon_p4
opposite_sign_muons_mask = two_muons_sample['Muon_charge'][:, 0] != two_muons_sample['Muon_charge'][:, 1]
dimuon_p4 = sum_p4[opposite_sign_muons_mask]
dimuon_p4

In [ ]:

import numpy as np

figsize_l, figsize_h = plt.rcParams["figure.figsize"]
plt.figure(figsize=(figsize_l*2.5, figsize_h*3.))

(yvals, binedges, patches) = plt.hist(dimuon_p4.mass, bins=np.logspace(np.log10(0.1), np.log10(1000), 200))

plt.xlabel('Dimuon invariant mass [GeV]')
plt.ylabel('Number of dimuon candidates')
plt.xscale('log')
plt.yscale('log')

import particle.literals as lpart
from hepunits import GeV
    
list_particles = [getattr(lpart,name) for name in ('eta', 'rho_770_0', 'omega_782','phi_1020','Jpsi_1S', 'Z_0')]

for p in list_particles:
    # Not a very clever way to "distribute" the particle labels but it will do
    binnumber = np.searchsorted(binedges, p.mass/GeV)
    scaling = 1.02
    if p.name == 'rho(770)0':
        x = p.mass/GeV - 0.2
        scaling = 0.9
    elif p.name == 'omega(782)':
        x = p.mass/GeV + 0.3
    else:
        x = p.mass/GeV
    plt.text(x, yvals[binnumber-1]*scaling, '${}$'.format(p.latex_name), horizontalalignment='center', fontsize=15)
        
plt.show()

Important note

These 2 packages provide a very extensive functionality set! Be sure to check their repositories and documentation.

awkward-array: https://github.com/scikit-hep/awkward-1.0.
uproot: https://github.com/scikit-hep/uproot4.

Files, data, event processing

Quick intro to the following packages¶

ROOT I/O in pure Python and NumPy

What is uproot?¶

Why does it exist?¶

Why Python + NumPy?¶

1. Getting data from a simple ROOT TTree¶

2. Data with jagged structure¶

Manipulate arrays of complex data structures as easily as NumPy

What is awkward-array?¶

Example - di-muons in a sample of selected muons¶

Important note

What is `uproot`?¶

What is `awkward-array`?¶