Compiled functions¶

Starting from version 0.19, heyoka.py can compile just-in-time (JIT) multivariate vector functions defined via the [expression system](<./The expression system.ipynb>). On the one hand, JIT compilation can greatly increase the performance of function evaluation with respect to a Python implementation of the same function. On the other hand, the JIT compilation process is computationally expensive and thus JIT compilation is most useful when a function needs to be evaluated repeatedly with different input values (so that the initial overhead of JIT compilation can be absorbed by the evaluation performance increase).

A simple example¶

As an initial example, we will JIT compile the simple function

$$ f\left(x, y \right) = x^2 - y^2. $$

Let us begin, as usual, with the introduction of the symbolic variables:

In [1]:

import heyoka as hy

x,y = hy.make_vars("x", "y")

Next, we define the function to be compiled:

In [2]:

sym_func = x**2-y**2

We can now proceed to JIT compile sym_func via the make_cfunc() function. make_cfunc() takes as a mandatory input argument the list of symbolic expressions representing the outputs of a vector function. In this case, we only have a single output, sym_func:

In [3]:

cf = hy.make_cfunc([sym_func])

The value returned by make_cfunc() is a callable function object which accepts as input a NumPy array representing the values to use in the evaluation of sym_func:

In [4]:

# Evaluate for x=1 and y=5.
cf([1,5])

Out[4]:

array([-24.])

The value returned by cf is also a NumPy array containing the outputs of the compiled function. In this specific case, we have a single output and thus an array with a single element is returned.

Because when we created cf we passed only the list of output expressions to make_cfunc(), heyoka.py used lexicographic ordering to infer the order of the input variables during evaluation: $x$ is the first variable, $y$ is the second. If you don't want to rely on the default lexicographic ordering, you can pass to make_cfunc() an explicit ordering for the input variables:

In [5]:

# Explicitly specify the order of the input variables:
# first y, then x.
cf2 = hy.make_cfunc([sym_func], vars=[y,x])

# Evaluate for x=1 and y=5.
cf2([5,1])

Out[5]:

array([-24.])

make_cfunc accepts additional keyword arguments, the most important of which is the boolean flag compact_mode (defaulting to False). Similarly to the [adaptive Taylor integrators](<./Customising the adaptive integrator.ipynb>), you should enable compact_mode if you want to compile extremely large symbolic expressions that result in excessively long compilation times. The downside of compact_mode is a slight performance degradation due to the different code generation model adopted during the JIT compilation process.

The function object returned by make_cfunc also accepts several optional keyword arguments. It is possible, for instance, to pass as outputs argument a pre-allocated NumPy array into which the result of the evaluation will be written. This is useful to avoid the overhead of allocating new memory for the return value, if such memory is already available:

In [6]:

import numpy as np

# Pre-allocate a NumPy array to store the
# result of the evaluation.
ret_arr = np.zeros((1,))

# Evaluate, specifying that the result
# will be written into ret_arr.
cf([1,5], outputs=ret_arr)

ret_arr

Out[6]:

array([-24.])

Functions with parameters¶

It the compiled function references external parameters, the parameters array will have to be supplied during evaluation via the pars keyword argument:

In [7]:

# A function with 3 parameters.
sym_func_par = hy.par[0]*x**2-hy.par[1]*y**2+hy.par[2]

# Compile it.
cf_par = hy.make_cfunc([sym_func_par])

# Evaluate, specifying the parameter values
cf_par([1,5], pars=[-1, -2, -3])

Out[7]:

array([46.])

Time-dependent functions¶

As seen in the tutorial on [non-autonomous ODEs](<./Non-autonomous systems.ipynb>), heyoka.py uses a special placeholder variable called heyoka.time to represent time (i.e., the independent variable) in ODEs.

Usually outside the context of ODE solving there are no compelling reasons to use the special heyoka.time variable instead of a regular variable. However, it can sometimes be useful to numerically evaluate the right-hand side of ODEs (e.g., for validation/sanity checking). Thus, starting with heyoka.py 0.21, expressions containing heyoka.time can also be compiled.

In a time-dependent compiled function, it is mandatory to supply the time value via the time keyword argument:

In [8]:

# Create a time-dependent function.
sym_func_tm = x**2-y**2+hy.time

# Compile it.
cf_tm = hy.make_cfunc([sym_func_tm])

# Evaluate for x=1, y=5 and time=6.
cf_tm([1,5], time=6)

Out[8]:

array([-18.])

Batched evaluations¶

An important feature of compiled functions is the ability to be evaluated over batches of input variables in a single evaluation. Let us see a simple example:

In [9]:

# Evaluate cf for x=[1,2,3] and y=[5,6,7].
cf([[1,2,3],[5,6,7]])

Out[9]:

array([[-24., -32., -40.]])

Because we now passed the two-dimensional array $[[1,2,3],[5,6,7]]$ as input argument (rather than a one-dimensional array, like in the previous examples), cf will be evaluated for multiple values of $x$ ($\left[1,2,3\right]$) and $y$ ($\left[5,6,7\right]$). The result also consists of a two-dimensional array in which the first dimension is the number of outputs (1 in this case), and the second dimension is the number of evaluation points (3 in this case).

Because heyoka.py makes extensive use of the SIMD instructions available in modern processors, a single batched evaluation will perform considerably better than multiple unbatched evaluations.

Batched evaluations of functions containing parameters and/or time are also supported:

In [10]:

# Batched evaluation with parameters.
cf_par([[1,2,3],[5,6,7]], pars=[[-1,-1.1,-1.2],[-2,-2.1,-2.2],[-3,-3.1,-3.2]])

Out[10]:

array([[46. , 68.1, 93.8]])

In [11]:

# Batched evaluation with time.
cf_tm([[1,2,3],[5,6,7]], time=[6,6.1,6.2])

Out[11]:

array([[-18. , -25.9, -33.8]])

Performance analysis¶

In order to assess the performance of heyoka.py's compiled functions, we will consider the evaluation of the Hamiltonian of the [restricted three-body problem](<./The restricted three-body problem.ipynb>), a 6-dimensional scalar function defined as:

$$ \mathcal{H}\left(p_x,p_y,p_z,x,y,z\right) = \frac{1}{2}\left( p_x^2+p_y^2+p_z^2 \right) +yp_x-xp_y-\frac{1-\mu}{r_{PS}}-\frac{\mu}{r_{PJ}}, $$

where

$$ \begin{align} \mu &= 0.1, \\ r_{PS} &=\sqrt{\left( x-\mu \right)^2+y^2+z^2}, \\ r_{PJ} &=\sqrt{\left( x -\mu + 1 \right)^2+y^2+z^2}. \end{align} $$

$\mathcal{H}$ will be evaluated on a grid of $10^8$ randomly-generated evaluation points:

In [12]:

# Deterministic seeding.
rng = np.random.default_rng(42)

# Generate 10**8 evaluation points randomly.
nevals = 100000000
inputs = rng.uniform(size=(6, nevals))

# The mu parameter.
mu = .1

NumPy¶

Let us begin with a NumPy-based evaluation:

In [13]:

# Evaluation of H via NumPy.
def Ham_np(px,py,pz,x,y,z):
    rps = np.sqrt((x-mu)**2 + (y**2+z**2))
    rpj = np.sqrt((x-mu+1)**2 + (y**2+z**2))
    
    return .5*(px**2+py**2+pz**2) + y*px - x*py - (1-mu)/rps - mu/rpj

# Extract the function arguments from
# the inputs array.
px_arr = inputs[0,:]
py_arr = inputs[1,:]
pz_arr = inputs[2,:]
x_arr = inputs[3,:]
y_arr = inputs[4,:]
z_arr = inputs[5,:]

# Time it.
%timeit Ham_np(px_arr,py_arr,pz_arr,x_arr,y_arr,z_arr)

2.12 s ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

heyoka.py¶

Let us now try with heyoka.py. First we define the symbolic variables and the mathematical expression of $\mathcal{H}$:

In [11]:

px,py,pz,x,y,z=hy.make_vars("px", "py", "pz", "x", "y", "z")

rps = hy.sqrt((x-mu)**2 + (y**2+z**2))
rpj = hy.sqrt((x-mu+1)**2 + (y**2+z**2))

Ham_sym = .5*(px**2+py**2+pz**2) + y*px - x*py - (1-mu)/rps - mu/rpj

Then we compile Ham_sym:

In [12]:

Ham_cf = hy.make_cfunc([Ham_sym], vars=[px,py,pz,x,y,z])

We can now time Ham_cf:

In [17]:

%timeit Ham_cf(inputs)

209 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can see how heyoka.py's compiled function is about 10 times faster than the NumPy based implementation. We can also appreciate the effect of providing an externally-allocated outputs array:

In [18]:

# Pre-allocate the outputs array.
outputs = np.zeros((1, nevals))
%timeit Ham_cf(inputs,outputs=outputs)

178 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The effect is not dramatic but measurable nevertheless.

JAX¶

As a last benchmark, we will be performing the same evaluation with JAX. Similarly to heyoka.py, JAX offers the possibility to JIT compile Python functions, so we expect similar performance to heyoka.py. Note that, in order to perform a fair comparison, for the execution of this notebook we enabled 64-bit floats in JAX and we used JAX's CPU backend forcing a single thread of execution (JAX by default uses multiple threads of execution, but heyoka.py's compiled functions do not yet support multithreaded execution).

Let us see the jax code:

In [15]:

import jax
import jax.numpy as jnp

# Turn inputs into a JAX array of float64.
jinputs = jnp.array(inputs, dtype=jnp.float64)

# Fetch the function arguments from jinputs.
jpx_arr = jinputs[0,:]
jpy_arr = jinputs[1,:]
jpz_arr = jinputs[2,:]
jx_arr = jinputs[3,:]
jy_arr = jinputs[4,:]
jz_arr = jinputs[5,:]

# The function for the evaluation of the Hamiltonian.
def Ham_jnp(jpx,jpy,jpz,jx,jy,jz):
    rps = jnp.sqrt((jx-mu)**2 + (jy**2+jz**2))
    rpj = jnp.sqrt((jx-mu+1)**2 + (jy**2+jz**2))
    
    return .5*(jpx**2+jpy**2+jpz**2) + jy*jpx - jx*jpy - (1-mu)/rps - mu/rpj

# Compile it.
Ham_jnp_jit = jax.jit(Ham_jnp)

# Warm up.
Ham_jnp_jit(jpx_arr,jpy_arr,jpz_arr,jx_arr,jy_arr,jz_arr).block_until_ready();

WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

We are now ready to run the benchmark:

In [19]:

%timeit Ham_jnp_jit(jpx_arr,jpy_arr,jpz_arr,jx_arr,jy_arr,jz_arr).block_until_ready()

304 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can indeed see how JAX's performance is similar to heyoka.py, although heyoka.py retains a performance edge.