Tai Sakuma
AlphaTwirl is a python library for summarizing event data into multivariate categorical data. qtwirl is one-function interface to AlphaTwirl. In this notebook, I will shows you by example how to summarize event data in flat ROOT trees and create Pandas' data frames with qtwirl.
We start by importing some common modules.
import sys
import numpy as np
import pandas as pd
Import the function qtwirl()
from the package qtwirl
.
from qtwirl import qtwirl
We will run a simple example of using qtwirl. We will read a ROOT file and create a 1-D histogram of a discrete variable.
The file used in this example is stored on github. This file contains one tree, named tree
, with 1,000 events. Let us store the URL to the file to the variable filapath
.
filepath = 'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root'
Now, we run the first example of qtwirl()
.
The function qtwirl()
requires only a few arguments but also accepts many optional arguments. Here, we use three arguments.
data
: the path to the input data file. This can be a list of paths if you have multiple input files.tree_name
: the name of the tree with event data to read in the file.reader_cfg
: a dict or a list of dicts—specifing how to summarize the event data. In this first example, it is simply a dict with one item. This argument can be long and complex.This example counts the number of the events for each value of njets
. In other words, it creates a histogram of njets
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='njets'))
The cell above doesn't print anything (except a progress bar).
The return value results
is a Padans' data frame.
type(results)
expected output: pandas.core.frame.DataFrame
.
Look at results
.
results
expected output:
njets | n | nvar | |
---|---|---|---|
0 | 0 | 20 | 20 |
1 | 1 | 84 | 84 |
2 | 2 | 139 | 139 |
3 | 3 | 198 | 198 |
4 | 4 | 203 | 203 |
5 | 5 | 152 | 152 |
6 | 6 | 100 | 100 |
7 | 7 | 44 | 44 |
8 | 8 | 43 | 43 |
9 | 9 | 10 | 10 |
10 | 10 | 7 | 7 |
The n
is the number of events with each value of njets
. The nvar
is the sum of the square of the weights. In this example, n
and nvar
have the same value in each row because the weight is one. The nvar
is useful when you need statistical uncertainty of weighted histograms.
The result of this simple example is a 1-D histogram. The histogram is created by the split-apply-combine strategy.
key_name
, which can be a variable name or a list of variable names.With split-apply-combine strategy, AlphaTwirl summarizes event data into multivariate categorical data. Histograms are particular cases of multivariate categorical data.
I will show a few more simple examples.
In the previous example, we used a discrete variable, njets
, to specify groups into which we split data. We can also use a continous variable if we bin it, i.e., put each value into one of pre-arranged intervals.
You can specify binnings by any function that takes a value and returns a discrete value. Here, we use one of the binning functors defined in AlphaTwirl, RoundLog
.
from alphatwirl.binning import RoundLog
The bin boundaries in RoundLog
are separated by a equal width in log scale, which is convenint for variables with steeply falling distributions, such as jet pT.
Bin boundaries are fully determined if the log of width and a boundary are specified.
binning = RoundLog(0.2, 100)
The log10 of the bin width is 0.2 and 100 is a boundary. Therefore, the bin boudaries are
$$ [\cdots, 10^{1.8}, 10^{2.0}, 10^{2.2}, 10^{2.4}, \cdots]. $$In RoundLog
, the lower edges are closed and the upper edges are open; the intervales are
The functor RoundLog
uses the lower edge of a bin as the bin lable. So, if you call binning
with a value, it returns the lower edge of the bin that the value belongs. For example,
binning(200)
In my environment, I got 158.48931924611142
. Because of the floating-point arithmetic, you might get a slighly different value.
The value 200 is within the interval $[10^{2.2}, 10^{2.4})$. Therefre, the lower edge of this interval $10^{2.2} \approx 158.49$ is returned.
By default, RoundLog
has no lowest or highest bin. So, binning
puts any positive value in an interal. For example,
binning(0.0005)
I got 0.00039810717055349654
, which is within the interval $[10^{-3.4}, 10^{-3.2})$.
$\log(0)$ is mathematically undefined. binning(0)
returns 0
.
binning(0)
expected output: 0
binning()
returns None
for a negative value.
print(binning(-1))
expected output: None
You can also give a very large number to binning()
.
binning(1000000000000000000000000000000000)
I got 9.999999999998528e+32
, which is approximately $10^{33.0}$, the lower edge of the interval $[10^{33.0}, 10^{33.2})$.
The largest possible value is determined by the system. For exampe, you can try giving a value larger than sys.float_info.max
.
print(binning(np.float128(sys.float_info.max)*1.0001))
I got None
. In the above code, I used np.float128
before multiplying by 1.0001
because sys.float_info.max*1.0001
is already inf
. The binning(float('inf'))
returns None
.
You can specify the lowest and highest bins of RoundLog
.
For example,
binning = RoundLog(0.2, 100, min=50, underflow_bin=0)
The lowest bin, specified by the option min
, is the bin that 50 belongs, which is around 39.8 as you can see by executing binning(50)
.
binning(50)
In my environment, I got 39.810717055349734
.
If you give a value smaller than this value, this function returns 0
, the underflow bin specified above.
binning(30)
expected output: 0
After so much about RoundLog
, we come back to qtwirl.
Now, as an example of summarizing a continous variable into categorical data, we create a 1-D histogram of met
with binnings defined by RoundLog
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='met', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results
The results inlcude all non-empty bins and the next bins of all non-empty bins. In the above example, the last bin (met=1000.00) is empty but is included because its previous bin (min=630.96) is not empty. The next bins of all non-empty bins are included so that you can find the upper edges of all non-empty bins. Notice that upper edges of empty bins have no infomratin about data.
In the above evempls, we created one data frame at a time.
You can also create multiple data frames with a single execution of qtwirl. You can give a list of configs to reader_cfg
. Here, we create the two histograms at once.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=[
dict(key_name='njets'),
dict(key_name='met', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0))])
The results
are a list of two data frames.
results[0]
results[1]
So far, we have been creating 1-D histograms, 1-D categorical data. You can also create multi-dimensional categorical data by giving a list of variable names to key_name
and a list of binning funcitons to key_binning
. If an input is a discrete variable and unnecessary to bin, you can give None
to the correspoinding element of the list for key_binning
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(
key_name=('njets', 'met'),
key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0))))
results
In principle, you can create any dimensions of categorical data.
In the case of two dimensions, the results might be easier to browse if you pivot them.
results.pivot(index='met', columns='njets', values='n').fillna(0)
In the above example, we created histograms of the variables njets
and met
, i.e., we defined the groups into which we split data by the variables njets
and met
. These variables are scalars, i.e., they each have one value per event. Events can also have arrays, e.g., jet_pt
. Here, I will show how to specify indices of arrays.
You can give the array index to key_index
. The next example creates the histogram of the 1st element of jet_pt
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='jet_pt', key_index=0, key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results
Instead of a specific element of an array, you can use all elments inclusively, by giving the wildcard '*'
to key_index
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='jet_pt', key_index='*', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results
If you are giving a list to key_name
, you need also give a list to key_index
. When some variables in key_name
are scalars and don't need indices, you can give None
to the indices for those variables. In the following exmample, we give None
to the index for met
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(
key_name=('met', 'jet_pt'),
key_index=(None, 0),
key_binning=(RoundLog(0.25, 100, min=10, underflow_bin=0), RoundLog(0.2, 100, min=50, underflow_bin=0))))
results.pivot(index='met', columns='jet_pt', values='n').fillna(0)
In general, you need to give lists with the same length to key_name
, key_index
, and key_binnning
unless you omit or give None
.
Note: Internally, a scalar variable is treated as an array with one element. So, you get the same results, if you give 0
instead of None
to the index for a scalar variable.
If you use the wildcard '*'
to two variables, all possible pairs of the two variables will be used. We will see an example here.
In the next example, we use another binning functor, Round
, from AlphaTwirl.
from alphatwirl.binning import Round
Round
works in a similar way to RoundLog
. While RoundLog
uses the same bin width in the log scale, Round
uses the same bin width.
We use the wildcard to jet_pt
and jet_eta
.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(
key_name=('jet_pt', 'jet_eta'),
key_index=('*', '*'),
key_binning=(RoundLog(0.25, 100, min=10, underflow_bin=0), Round(1, 0.5))))
results.pivot(index='jet_pt', columns='jet_eta', values='n').fillna(0)
The results are a 2-D histogram of all possible pairs of jet_pt
and jet_eta
. All possible pairs can be useful for some pairs of variables but not for this particular pair. We will make a more sensible 2-D histogram of pT and eta of inclusive jets in the next example.
It makes more sense to use the same index for both jet_pt
and jet_eta
, which can be accomplisehd by back references as in the following example.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(
key_name=('jet_pt', 'jet_eta'),
key_index=('(*)', '\\1'),
key_binning=(RoundLog(0.25, 100, min=10), Round(1))))
results.pivot(index='jet_pt', columns='jet_eta', values='n').fillna(0)
The syntax for the back reference is inspired by the regeular expression. The index for jet_eta
in the example is '\\1'
. This instructs AlphaTwirl to use the same index as the first wildcard wihtin the parenthes '(*)'
. In general, '\\n'
means to use the same index as the n-th '(*)'
.
In AlphaTwirl, the conditions of event selections can be specified as a nested combinations of All
and Any
in python dicts.
dict(All=('ev: ev.ht[0] >= 400',
'ev: ev.mht[0] >= 200',
dict(Any=('ev: ev.nJet[0] == 1',
dict(All=('ev: ev.nJet[0] >= 2',
'ev: ev.minChi[0] >= 0.7'))
))))
Events need to satisfy all conditions listed in a value for the key All
. Events need to satisfy at least one of the conditions listed in a value for the key Any
.
Note: Strings like 'ev: ev.ht[0] >= 400'
will be parsed and will be given to python lambda. In the case of 'ev: ev.ht[0] >= 400'
, ev
is the event object in AlphaTwirl; the value of the first element of the attribue ht
needs to be greater than or equal to 400
.
Note: Strings are used instead of lambdas themselves because lambdas are not picklable.
The following example shows how to apply event selections. It uses a simple condition, dict(All=('ev: ev.njets[0] > 4', )
. It creates the same histograms before and after the event selection.
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=[
dict(key_name=('njets', 'met'),
key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0))),
dict(selection_cfg=dict(All=('ev: ev.njets[0] > 4', ))),
dict(key_name=('njets', 'met'),
key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0)))
])
The histogram before the event selection.
results[0].pivot(index='met', columns='njets', values='n').fillna(0)
The histogram after the event selection.
results[1].pivot(index='met', columns='njets', values='n').fillna(0)
An example of adding variables
from scribblers.essentials import FuncOnNumpyArrays
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=[
dict(reader=FuncOnNumpyArrays(
src_arrays=['jet_pt'],
out_name='ht',
func=np.sum)),
dict(key_name='ht',
key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)),
])
results[0]
filepath = [
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='njets'))
results
filepath = [
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_03_zombie.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='njets'))
results
filepath = [
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_03_zombie.root',
'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
filepath, tree_name='tree',
reader_cfg=dict(key_name='njets'),
skip_error_files=False)
results