Efficient Python analysis with dynamic C++ and just-in-time compilation¶

PyROOT has the goal to combine the convenience of the Python language with the efficiency of C++ implementations. The dynamic C++ bindings of PyROOT powered by the C++ interpreter cling allow to use conveniently efficient implementations in Python.

In [1]:

import ROOT
import numpy as np
np.random.seed(1234)

Welcome to JupyROOT 6.22/00

Just-in-time compilation of C++ functions¶

Just-in-time compilation (jitting) allows to use the power of the C++ compiler to optimize computation heavy operations. Just like numpy implements the actual functionality under the hood in C(++), PyROOT allows you to do the same dynamically.

In [2]:

ROOT.gInterpreter.Declare('''
float largest_sum(float* v1, float* v2, std::size_t size){
    float r = -999.f;
    for (size_t i1 = 0; i1 < size; i1++) {
        for (size_t i2 = 0; i2 < size; i2++) {
            const auto tmp = v1[i1] + v2[i2];
            if (tmp > r) r = tmp;
        }
    }
    return r;
}
''');

As example inputs, we generate two numpy arrays with random numbers.

In [3]:

size = 100
v1 = np.random.randn(size).astype(np.float32)
v2 = np.random.randn(size).astype(np.float32)

And next we benchmark the runtime:

In [4]:

%%timeit
ROOT.largest_sum(v1, v2, size)

25.5 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How does the C++ kernel compare to a pure Python implementation?

In [5]:

def largest_sum(x1, x2):
    r = -999.0
    for e1 in x1:
        for e2 in x2:
            tmp = e1 + e2
            if tmp > r: r = tmp
    return r

The Python implementation is a factor of 100 slower!

In [6]:

%%timeit
largest_sum(v1, v2)

2.25 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Loading of precompiled functions¶

Improved C++ performance can be expected by precompiling the functionality and loading interfacing the functions via PyROOT.

In [7]:

%%bash
echo 'Header:'
cat analysis.hxx
echo
echo 'Source:'
cat analysis.cxx
g++ -Ofast -shared -o libanalysis.so analysis.cxx

Header:
#include <cstddef>
float optimized_largest_sum(float* v1, float* v2, std::size_t size);

Source:
# include "analysis.hxx"

float optimized_largest_sum(float* v1, float* v2, std::size_t size){
    float r = -999.f;
    for (size_t i1 = 0; i1 < size; i1++) {
        for (size_t i2 = 0; i2 < size; i2++) {
            const auto tmp = v1[i1] + v2[i2];
            if (tmp > r) r = tmp;
        }
    }
    return r;
}

You can interactively include the header and functionality from the shared library.

In [8]:

ROOT.gInterpreter.Declare('#include "analysis.hxx"')
ROOT.gSystem.Load('libanalysis.so');

The optimized compilation improves the runtime again by a factor of 5!

In [9]:

%%timeit
ROOT.optimized_largest_sum(v1, v2, size)

4.83 µs ± 468 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Finally, we can show that all implementations come to the same result:

In [10]:

print('PyROOT:', ROOT.largest_sum(v1, v2, size))
print('Native Python:', largest_sum(v1, v2))
print('PyROOT (optimized):', ROOT.optimized_largest_sum(v1, v2, size))

PyROOT: 4.7567291259765625
Native Python: 4.756729
PyROOT (optimized): 4.7567291259765625