Quick start of AlphaTwirl

Tai Sakuma

AlphaTwirl is a python library for summarizing event data into multivariate categorical data. qtwirl is one-function interface to AlphaTwirl. In this notebook, I will shows you by example how to summarize event data in flat ROOT trees and create Pandas' data frames with qtwirl.


Table of contents


➼ Preparation

➻ Import common modules

We start by importing some common modules.

In [ ]:
import sys
import numpy as np
import pandas as pd

➻ Import the function qtwirl

Import the function qtwirl() from the package qtwirl.

In [ ]:
from qtwirl import qtwirl

➼ The first example

➻ 1-D histogram of discrete variable

We will run a simple example of using qtwirl. We will read a ROOT file and create a 1-D histogram of a discrete variable.

The file used in this example is stored on github. This file contains one tree, named tree, with 1,000 events. Let us store the URL to the file to the variable filapath.

In [ ]:
filepath = 'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root'

Now, we run the first example of qtwirl().

The function qtwirl() requires only a few arguments but also accepts many optional arguments. Here, we use three arguments.

  • data: the path to the input data file. This can be a list of paths if you have multiple input files.
  • tree_name: the name of the tree with event data to read in the file.
  • reader_cfg: a dict or a list of dicts—specifing how to summarize the event data. In this first example, it is simply a dict with one item. This argument can be long and complex.

This example counts the number of the events for each value of njets. In other words, it creates a histogram of njets.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='njets'))

The cell above doesn't print anything (except a progress bar).

The return value results is a Padans' data frame.

In [ ]:
type(results)

expected output: pandas.core.frame.DataFrame.

Look at results.

In [ ]:
results

expected output:

njets n nvar
0 0 20 20
1 1 84 84
2 2 139 139
3 3 198 198
4 4 203 203
5 5 152 152
6 6 100 100
7 7 44 44
8 8 43 43
9 9 10 10
10 10 7 7

The n is the number of events with each value of njets. The nvar is the sum of the square of the weights. In this example, n and nvar have the same value in each row because the weight is one. The nvar is useful when you need statistical uncertainty of weighted histograms.

Histograms in terms of the split-apply-combine strategy

The result of this simple example is a 1-D histogram. The histogram is created by the split-apply-combine strategy.

  • split: split the data into groups in terms of the variables specified by key_name, which can be a variable name or a list of variable names.
  • apply: apply a function to summarize the data in each group. The default function in AlphaTwirl is to count the number of the entries in the group. Thus, by default, the summary of the data in each group is the number of the entries in the group.
  • combine: combine the summaries of data from all groups. By default, qtwirl combines summaries into Pandas' data frames.

With split-apply-combine strategy, AlphaTwirl summarizes event data into multivariate categorical data. Histograms are particular cases of multivariate categorical data.


➼ More simple examples

I will show a few more simple examples.

➻ continous variable and binnings

In the previous example, we used a discrete variable, njets, to specify groups into which we split data. We can also use a continous variable if we bin it, i.e., put each value into one of pre-arranged intervals.

Functor RoundLog

You can specify binnings by any function that takes a value and returns a discrete value. Here, we use one of the binning functors defined in AlphaTwirl, RoundLog.

In [ ]:
from alphatwirl.binning import RoundLog

The bin boundaries in RoundLog are separated by a equal width in log scale, which is convenint for variables with steeply falling distributions, such as jet pT.

Bin boundaries are fully determined if the log of width and a boundary are specified.

In [ ]:
binning = RoundLog(0.2, 100)

The log10 of the bin width is 0.2 and 100 is a boundary. Therefore, the bin boudaries are

$$ [\cdots, 10^{1.8}, 10^{2.0}, 10^{2.2}, 10^{2.4}, \cdots]. $$

In RoundLog, the lower edges are closed and the upper edges are open; the intervales are

$$ \cdots, [10^{1.8}, 10^{2.0}), [10^{2.0}, 10^{2.2}), [10^{2.2}, 10^{2.4}), \cdots. $$

The functor RoundLog uses the lower edge of a bin as the bin lable. So, if you call binning with a value, it returns the lower edge of the bin that the value belongs. For example,

In [ ]:
binning(200)

In my environment, I got 158.48931924611142. Because of the floating-point arithmetic, you might get a slighly different value.

The value 200 is within the interval $[10^{2.2}, 10^{2.4})$. Therefre, the lower edge of this interval $10^{2.2} \approx 158.49$ is returned.

By default, RoundLog has no lowest or highest bin. So, binning puts any positive value in an interal. For example,

In [ ]:
binning(0.0005)

I got 0.00039810717055349654, which is within the interval $[10^{-3.4}, 10^{-3.2})$.

$\log(0)$ is mathematically undefined. binning(0) returns 0.

In [ ]:
binning(0)

expected output: 0

binning() returns None for a negative value.

In [ ]:
print(binning(-1))

expected output: None

You can also give a very large number to binning().

In [ ]:
binning(1000000000000000000000000000000000)

I got 9.999999999998528e+32, which is approximately $10^{33.0}$, the lower edge of the interval $[10^{33.0}, 10^{33.2})$.

The largest possible value is determined by the system. For exampe, you can try giving a value larger than sys.float_info.max.

In [ ]:
print(binning(np.float128(sys.float_info.max)*1.0001))

I got None. In the above code, I used np.float128 before multiplying by 1.0001 because sys.float_info.max*1.0001 is already inf. The binning(float('inf')) returns None.

You can specify the lowest and highest bins of RoundLog.

For example,

In [ ]:
binning = RoundLog(0.2, 100, min=50, underflow_bin=0)

The lowest bin, specified by the option min, is the bin that 50 belongs, which is around 39.8 as you can see by executing binning(50).

In [ ]:
binning(50)

In my environment, I got 39.810717055349734.

If you give a value smaller than this value, this function returns 0, the underflow bin specified above.

In [ ]:
binning(30)

expected output: 0

back to qtwirl

After so much about RoundLog, we come back to qtwirl.

Now, as an example of summarizing a continous variable into categorical data, we create a 1-D histogram of met with binnings defined by RoundLog.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='met', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results

The results inlcude all non-empty bins and the next bins of all non-empty bins. In the above example, the last bin (met=1000.00) is empty but is included because its previous bin (min=630.96) is not empty. The next bins of all non-empty bins are included so that you can find the upper edges of all non-empty bins. Notice that upper edges of empty bins have no infomratin about data.

➻ Multiple data frames

In the above evempls, we created one data frame at a time.

You can also create multiple data frames with a single execution of qtwirl. You can give a list of configs to reader_cfg. Here, we create the two histograms at once.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=[
        dict(key_name='njets'),
        dict(key_name='met', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0))])

The results are a list of two data frames.

In [ ]:
results[0]
In [ ]:
results[1]

➻ Multi-dimensional categorical data

So far, we have been creating 1-D histograms, 1-D categorical data. You can also create multi-dimensional categorical data by giving a list of variable names to key_name and a list of binning funcitons to key_binning. If an input is a discrete variable and unnecessary to bin, you can give None to the correspoinding element of the list for key_binning.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(
        key_name=('njets', 'met'),
        key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0))))
results

In principle, you can create any dimensions of categorical data.

In the case of two dimensions, the results might be easier to browse if you pivot them.

In [ ]:
results.pivot(index='met', columns='njets', values='n').fillna(0)

➼ Examples of reading array and indices

In the above example, we created histograms of the variables njets and met, i.e., we defined the groups into which we split data by the variables njets and met. These variables are scalars, i.e., they each have one value per event. Events can also have arrays, e.g., jet_pt. Here, I will show how to specify indices of arrays.

➻ Specific element

You can give the array index to key_index. The next example creates the histogram of the 1st element of jet_pt.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='jet_pt', key_index=0, key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results

➻ All elements—wildcard

Instead of a specific element of an array, you can use all elments inclusively, by giving the wildcard '*' to key_index.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='jet_pt', key_index='*', key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)))
results

➻ Multiple dimensions

If you are giving a list to key_name, you need also give a list to key_index. When some variables in key_name are scalars and don't need indices, you can give None to the indices for those variables. In the following exmample, we give None to the index for met.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(
        key_name=('met', 'jet_pt'),
        key_index=(None, 0),
        key_binning=(RoundLog(0.25, 100, min=10, underflow_bin=0), RoundLog(0.2, 100, min=50, underflow_bin=0))))
results.pivot(index='met', columns='jet_pt', values='n').fillna(0)

In general, you need to give lists with the same length to key_name, key_index, and key_binnning unless you omit or give None.

Note: Internally, a scalar variable is treated as an array with one element. So, you get the same results, if you give 0 instead of None to the index for a scalar variable.

➻ All combinations

If you use the wildcard '*' to two variables, all possible pairs of the two variables will be used. We will see an example here.

In the next example, we use another binning functor, Round, from AlphaTwirl.

In [ ]:
from alphatwirl.binning import Round

Round works in a similar way to RoundLog. While RoundLog uses the same bin width in the log scale, Round uses the same bin width.

We use the wildcard to jet_pt and jet_eta.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(
        key_name=('jet_pt', 'jet_eta'),
        key_index=('*', '*'),
        key_binning=(RoundLog(0.25, 100, min=10, underflow_bin=0), Round(1, 0.5))))
results.pivot(index='jet_pt', columns='jet_eta', values='n').fillna(0)

The results are a 2-D histogram of all possible pairs of jet_pt and jet_eta. All possible pairs can be useful for some pairs of variables but not for this particular pair. We will make a more sensible 2-D histogram of pT and eta of inclusive jets in the next example.

➻ Back references

It makes more sense to use the same index for both jet_pt and jet_eta, which can be accomplisehd by back references as in the following example.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(
        key_name=('jet_pt', 'jet_eta'),
        key_index=('(*)', '\\1'),
        key_binning=(RoundLog(0.25, 100, min=10), Round(1))))
results.pivot(index='jet_pt', columns='jet_eta', values='n').fillna(0)

The syntax for the back reference is inspired by the regeular expression. The index for jet_eta in the example is '\\1'. This instructs AlphaTwirl to use the same index as the first wildcard wihtin the parenthes '(*)'. In general, '\\n' means to use the same index as the n-th '(*)'.

➼ Event selections

In AlphaTwirl, the conditions of event selections can be specified as a nested combinations of All and Any in python dicts.

dict(All=('ev: ev.ht[0] >= 400',
                  'ev: ev.mht[0] >= 200',
                  dict(Any=('ev: ev.nJet[0] == 1',
                       dict(All=('ev: ev.nJet[0] >= 2',
                                 'ev: ev.minChi[0] >= 0.7'))
        ))))

Events need to satisfy all conditions listed in a value for the key All. Events need to satisfy at least one of the conditions listed in a value for the key Any.

Note: Strings like 'ev: ev.ht[0] >= 400' will be parsed and will be given to python lambda. In the case of 'ev: ev.ht[0] >= 400', ev is the event object in AlphaTwirl; the value of the first element of the attribue ht needs to be greater than or equal to 400.

Note: Strings are used instead of lambdas themselves because lambdas are not picklable.

The following example shows how to apply event selections. It uses a simple condition, dict(All=('ev: ev.njets[0] > 4', ). It creates the same histograms before and after the event selection.

In [ ]:
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=[
        dict(key_name=('njets', 'met'),
             key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0))),
        dict(selection_cfg=dict(All=('ev: ev.njets[0] > 4', ))),
        dict(key_name=('njets', 'met'),
             key_binning=(None, RoundLog(0.2, 100, min=50, underflow_bin=0)))
    ])

The histogram before the event selection.

In [ ]:
results[0].pivot(index='met', columns='njets', values='n').fillna(0)

The histogram after the event selection.

In [ ]:
results[1].pivot(index='met', columns='njets', values='n').fillna(0)

➼ Adding variables—scribblers

An example of adding variables

In [ ]:
from scribblers.essentials import FuncOnNumpyArrays
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=[
        dict(reader=FuncOnNumpyArrays(
            src_arrays=['jet_pt'],
            out_name='ht',
            func=np.sum)),
        dict(key_name='ht',
             key_binning=RoundLog(0.2, 100, min=50, underflow_bin=0)),
    ])
results[0]

➼ multiple input files

➻ First example

In [ ]:
filepath = [
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='njets'))
results

➻ Skip zombie files

In [ ]:
filepath = [
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_03_zombie.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='njets'))
results

➻ Don't skip zombie files

In [ ]:
filepath = [
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_01.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_02.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_03_zombie.root',
    'https://github.com/alphatwirl/qtwirl/raw/v0.06.0/tests/data/root/sample_chain_04.root',
]
results = qtwirl(
    filepath, tree_name='tree',
    reader_cfg=dict(key_name='njets'),
    skip_error_files=False)
results

➼ Concurrency

coming..

➻ Multiprocessing

coming..

➻ HTCondor

coming..

In [ ]: