uproot
- ROOT I/O library in pure Python and NumPy.awkward-array
- manipulate arrays of complex data structures as easily as NumPy.
uproot
?¶Effectively what connects HEP data (ROOT format) with the Python scientific ecosystem around NumPy!
uproot
provides very fast, efficient, and convenient access to ROOT trees.
event-handling infrastructure of the ROOT framework. 3. To express the semantics and conventions of the ROOT file format independently of ROOT, in lieu of a formal specification.
decompression and array manipulations are done by compiled code, see the now-old-ish performance study below.
By "simple" we mean a file without jagged structures, or nested structures with branch sizes depending on an event-by-event basis.
import uproot
f = uproot.open('data/sample_simple-example.root')
f
ROOT files, directories, and trees are like Python dicts with keys() and values().
f.keys()
t = f["events"]
t.keys()
t['M']
Uproot's main purpose is to read branches from ROOT files as NumPy arrays:
t['M'].array()
All branches can be looked at with t.arrays()
. A subset is specified e.g. as t.arrays(['Run', 'Event'])
:
t.arrays()
One can now start performing calculations. But let's avoid explicit loops and rather exploit the ecosystem ...
Create a Pandas DataFrame
import pandas
df = t.arrays(library="pd")
df
t.arrays(['Run', 'Event', 'pt1', 'pt2'], library="pd").head()
f = uproot.open('data/sample_muons.root')
f.keys()
branches = f['Events'].arrays()
branches
The jagged structure here comes from the number of muons per event, which is variable:
branches['nMuon']
This becomes evident when checking for example the $p_T$ of all muons:
branches['Muon_pt']
Print the $p_T$ for the muons in the first 10 events to trivially see the jagged structure:
print(' \n'.join([str(elm) for elm in branches['Muon_pt'][:10]]))
We will get back to jagged arrays in a sec. Let's first show that uproot
also has (limited) writing functionality.
awkward-array
?¶Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.
As seen above, the number of muons varies on an event-by-event basis:
branches['nMuon']
Can we see any structure in those events containing a $\mu^+ \mu^-$ pair? Interesting since many particles are known to decay as such, e.g. $J/\psi \to \mu^+ \mu^-$.
Let's first investigate the sample a bit further.
import awkward as ak
branches
ak.type(branches)
1000000 *
means that there are a 100k events, "Muon_pt": var *
means that the contents of the "Muon_pt"
field are jagged: there's a variable number of them per event.
We could look at a few of these as Python lists and dicts.
ak.to_list(branches[:3])
branches['nMuon'][0]
Nothing like a visual inspection, though:
import matplotlib.pyplot as plt
plt.hist(branches['nMuon'], bins=10, range=(0, 10))
plt.xlabel('Number of muons in event')
plt.ylabel('Number of events');
Quickly do the same with Scikit-HEP histogramming tools (raison d'être of the Hist
package)?
from hist import Hist
Hist.new.Reg(10, 0 ,10, name="nMuon", label="Number of muons in event").Double().fill(branches['nMuon'])
The above even tells us something we did not realise from the other plot - there are 110 overflow entries. Indeed
ak.count(branches['nMuon'][branches['nMuon'] >=10])
How many muon entries are there in total?
# ak.num gives the number of elements in each nested list
len(branches['Muon_pt']), ak.sum(branches['nMuon']), sum(ak.num(branches['Muon_pt'])) # 235286 muons in 1e5 events
sum(branches.nMuon)
ak.type(branches.Muon_pt)
Plot the $p_T$ and $\eta$ of all muons:
plt.hist(ak.flatten(branches.Muon_pt), bins=100, range=(0, 100))
plt.xlabel('Muon pT')
plt.ylabel('Number of candidates')
plt.yscale('log');
plt.hist(ak.flatten(branches.Muon_eta), bins=100, range=(-2.5, 2.5))
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of candidates');
%%timeit
len(ak.flatten(branches.Muon_pt))
%%timeit
sum(branches.nMuon)
Selections are done via masks. Let's create one that singles out events with a single muon:
branches['nMuon'] == 1
single_muon_mask = branches['nMuon'] == 1
print("There are {} single-muon events.".format(sum(single_muon_mask)))
Just checking:
len(branches['Muon_pt'][single_muon_mask]), branches['Muon_pt'][single_muon_mask]
plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [MeV]')
plt.ylabel('Number of single muons / 1 MeV')
plt.yscale('log')
plt.show()
Mask to select muons within $|\eta| <2$:
eta_mask = abs(branches['Muon_eta']) < 2
eta_mask
ak.sum(ak.flatten(eta_mask))
ak.num(eta_mask)
ak.sum(eta_mask)
Again, a visual inspection never harms.
plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))
plt.title('No selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()
plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))
plt.title('With $|\eta| < 2$ selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()
len(single_muon_mask & eta_mask)
plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
density=True,
bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()
At last, concentrate on 2-muon events:
two_muons_mask = branches['nMuon'] == 2
two_muons_sample = branches[two_muons_mask]
sum(two_muons_mask)
two_muons_sample['Muon_pt']
import uproot3_methods
first_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,0],
two_muons_sample['Muon_eta'][:,0],
two_muons_sample['Muon_phi'][:,0],
two_muons_sample['Muon_mass'][:,0]
)
first_muon_p4
second_muon_p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphim(two_muons_sample['Muon_pt'][:,1],
two_muons_sample['Muon_eta'][:,1],
two_muons_sample['Muon_phi'][:,1],
two_muons_sample['Muon_mass'][:,1]
)
second_muon_p4
len(first_muon_p4)
first_muon_p4.delta_r(second_muon_p4)
plt.hist(first_muon_p4.delta_r(second_muon_p4), bins=100)
plt.xlabel('$\Delta R$ between muons')
plt.ylabel('Number of two-muon events')
plt.show()
Are we done? No! We have not checked that the 2 muons have opposite charge ... Further refine to $\mu^+\mu^-$ pairs - you see where're getting ;-):
sum_p4 = first_muon_p4 + second_muon_p4
opposite_sign_muons_mask = two_muons_sample['Muon_charge'][:, 0] != two_muons_sample['Muon_charge'][:, 1]
dimuon_p4 = sum_p4[opposite_sign_muons_mask]
dimuon_p4
import numpy as np
figsize_l, figsize_h = plt.rcParams["figure.figsize"]
plt.figure(figsize=(figsize_l*2.5, figsize_h*3.))
(yvals, binedges, patches) = plt.hist(dimuon_p4.mass, bins=np.logspace(np.log10(0.1), np.log10(1000), 200))
plt.xlabel('Dimuon invariant mass [GeV]')
plt.ylabel('Number of dimuon candidates')
plt.xscale('log')
plt.yscale('log')
import particle.literals as lpart
from hepunits import GeV
list_particles = [getattr(lpart,name) for name in ('eta', 'rho_770_0', 'omega_782','phi_1020','Jpsi_1S', 'Z_0')]
for p in list_particles:
# Not a very clever way to "distribute" the particle labels but it will do
binnumber = np.searchsorted(binedges, p.mass/GeV)
scaling = 1.02
if p.name == 'rho(770)0':
x = p.mass/GeV - 0.2
scaling = 0.9
elif p.name == 'omega(782)':
x = p.mass/GeV + 0.3
else:
x = p.mass/GeV
plt.text(x, yvals[binnumber-1]*scaling, '${}$'.format(p.latex_name), horizontalalignment='center', fontsize=15)
plt.show()
These 2 packages provide a very extensive functionality set! Be sure to check their repositories and documentation.
awkward-array
: https://github.com/scikit-hep/awkward-1.0.uproot
: https://github.com/scikit-hep/uproot4.