HPC in Python with TensorFlow¶

In [ ]:

try:
    import google.colab
except ImportError:
    pass
else:
    !pip install zfit numpy tensorflow_probability

In [ ]:

import tensorflow as tf

Depending on the CPU used and the TensorFlow binaries, what we will see (not in the Jupyter Notebook) are a bunch of messages, including the following: tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

What could they mean?

AVX stands for Advanced Vector Extensions and are instruction that the CPU can perform. They are specializations on vector operations (remember? SIMD, CPU inststruction set, etc.)

Why do they appear?

The code that we are using was not compiled with this flag on. This means, TensorFlow assumes that the CPU does not support this instructions and instead uses non-optimized ones. The reason is that this allows the binary (=compiled code) to also be run on a CPU that does not support then. While we use only some speed. (yes, technically TensorFlow can be faster when compiled natively on your computer, but then it takes time and effort)

In [ ]:

# import numba  # numba fails, not compatible with current python
import numpy as np
import tensorflow_probability as tfp
import zfit
from zfit import z

Python and Numpy¶

Let's start with a simple comparison of Numpy, an AOT compiled library, versus pure Python

In [ ]:

size1 = 100000
list1 = [np.random.uniform() for _ in range(size1)]
list2 = [np.random.uniform() for _ in range(size1)]
list_zeros = [0] * size1

ar1 = np.array(list1)
ar2 = np.random.uniform(size=size1)  # way more efficient!
ar_zeros = np.zeros_like(ar1) # quite useful function the *_like -> like the object
# we could also create the below, in general better:
# ar_empty = np.empty(shape=size1, dtype=np.float64)

In [ ]:

%%timeit  -n10
for i in range(size1):
    list_zeros[i] = list1[i] + list2[i]

In [ ]:

%%timeit  -n10
ar1 + ar2

( playground : we can also try assignements here or simliar)

Fast and slow¶

Before we go deeper into the topic, we can draw two conclusions:

slow Python is not "slow": it is still blazingly fast on an absolute scale, e.g if you need to loop over a few hundred points, it's still nothing. But it can add up!
Numpy is a factor of 300 faster for this case (and: better reabable!)

=> there is no reason to ever add (numerical) arrays with a for loop in Python (except for numba jit)

TensorFlow¶

As mentioned, TensorFlow is basically Numpy. Let's check that out

In [ ]:

rnd1_np = np.random.uniform(size=10, low=0, high=10)
rnd1_np  # adding a return value on the last line without assigning it prints the value

In [ ]:

rnd1 = tf.random.uniform(shape=(10,),  # notice the "shape" argument: it's more picky than Numpy
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
rnd2 = tf.random.uniform(shape=(10,),
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)

In [ ]:

rnd1

This is in fact a "numpy array wrapped" and can explicitly be converted to an array

In [ ]:

rnd1.numpy()

Other operations act as we would expect it

In [ ]:

rnd1 + 10

... and it converts itself (often) to Numpy when needed.

In [ ]:

np.sqrt(rnd1)

We can slice it...

In [ ]:

rnd1[1:3]

...expand it....

In [ ]:

rnd1[None, :, None]

...and broadcast with the known (maybe slightly stricter) rules

In [ ]:

matrix1 = rnd1[None, :] * rnd1[:, None]
matrix1

Equivalent operations¶

Many operations that exist in Numpy also exist in TensorFlow, sometimes with a different name.

The concept however is exactly the same: we have higher level objects such as Tensors (or arrays) and call operations on it with arguments. This is a "strong limitation" (theoretically) of what we can do, however, since we do math, there is only a limited set we need, and in practice this suffices for 98% of the cases.

Therefore we won't dive too deep into the possibilities of TensorFlow/Numpy regarding operations but it is suggested to read the API docs, many are self-explanatory. It can be surprising that there is also some support for more exotic elements such as RaggedTensors and operations and SparseTensors and operations

Mostly, the differences and the terminology will be introduced.

In [ ]:

tf.sqrt(rnd1)

In [ ]:

tf.reduce_sum(matrix1, axis=0)  # with the axis argument to specify over which to reduce

DTypes¶

TensorFlow is more picky on dtypes as Numpy and does not automatically cast dtypes. That's why we can often get a dtype error. Solution: make sure you add a x = tf.cast(x, dtype=tf.float64) (or whatever dtype we want) to cast it into the right dtype.

One noticable difference: TensorFlow uses float32 as the default for all operations. Neural Networks function quite well with that (sometimes even with float16) but for (most) scientific use-cases, we want to use float64. So yes, currently, we have to define this in (too) many places.

Sidenote : Also zfit uses tf.float64 throughout internally. It's an often seen problem that custom shapes will raise a dtype error.

What we can't do: assignements¶

The idea of TensorFlow evolves around building a Graph$^{1)}$, the Eager (Numpy-like) execution is more of a "we get it for free addition", but does not lead the development ideas. This has one profound implication, namely that we cannot make an assignement to a Tensor, because it is a node in a graph. The logic just does not work (exception: tf.Variable$^{2)}). This does not mean that TensorFlow would not perform in-place operations behind the scenes - it very well does if it is save to do so. Since TensorFlow knows the whole graph with all dependencies, this can be figured out.

Even if EagerMode, assignements could work (as for Numpy arrays), they are forbidden for consistency (one of the great plus points of TensorFlow).

TensorFlow list¶

an advanced structure that can also be modified is the tf.TensorArray, which acts like a list or ArrayList (in other languages). It is however advanced and not usually used, so we won't go into the details here.

1) if you're familiar with TensorFlow 1, this statement would suprise you as pretty obvious; but in TensorFlow 2, this is luckily more hidden. 2) more general, ResourceHandler

In [ ]:

rnd1_np[5] = 42

In [ ]:

try:
    rnd1[5] = 42
except TypeError as error:
    print(error)

Speed comparison¶

Let's do the same calculation as with Numpy. The result should be comparable: both are AOT compiled libraries specialized on numerical, vectorized operations.

In [ ]:

rnd1_big = tf.random.uniform(shape=(size1,),  # notice the "shape" argument: it's more picky than Numpy
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
rnd2_big = tf.random.uniform(shape=(size1,),
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)

In [ ]:

%%timeit   -n100
rnd1_big + rnd2_big

In [ ]:

%%timeit   -n100 # using numpy, same as before
ar1 + ar2

Looks like the same as Numpy. Let's compare with smaller arrays

In [ ]:

rnd1_np = rnd1.numpy()
rnd2_np = rnd2.numpy()

In [ ]:

rnd1_np

In [ ]:

%%timeit -n100
rnd1_np + rnd2_np

In [ ]:

%%timeit -n100
rnd1 + rnd2

TensorFlow is slow?¶

We see now a significant difference in the runtime! This is because TensorFlow has a larger overhead than Numpy. As seen before, this is not/barely noticable for larger arrays, however for very small calculations, this is visible.

There is more overhead because TensorFlow tries to be "smarter" about many things than Numpy and does not simply directly execute the computation.

The cost is a slowdown on very small operations but a better scaling and improved performance with larger arrays and more complicated calculations.

In [ ]:

# relative speeds may differ, depending on the hardware used.
# size_big = 10  # numpy faster
size_big = 20000  # sameish
# size_big = 100000  # TF faster
# size_big = 1000000  # TF faster
# size_big = 10000000  # TF faster
# size_big = 100000000  # TF faster

In [ ]:

%%timeit -n10
tf.random.uniform(shape=(size_big,), dtype=tf.float64) + tf.random.uniform(shape=(size_big,), dtype=tf.float64)

In [ ]:

%%timeit -n10
np.random.uniform(size=(size_big,)) + np.random.uniform(size=(size_big,))

TensorFlow kernels¶

In general, TensorFlow is preciser in what input arguments are required compared to Numpy and does less automatic dtype casting and asks more explicit for shapes. For example, integers don't work in the logarithm. However, this error message illustrates very well the kernel dispatch system of TensorFlow, so lets do it!

In [ ]:

try:
    tf.math.log(5)
except (tf.errors.NotFoundError, tf.errors.InvalidArgumentError) as error: # TF 2.4 error, TF 2.5 error
    print(error)

What we see here: it searches the registered kernels and does not find any that supports this operation. We find different classifications:

GPU: normal GPU kernel
CPU: normal CPU kernel
XLA: Accelerated Linear Algebra is a high-level compiler that can fuse operations, which would result in single calls to a fused kernel.

Tracing: tf.function¶

Let's see the JIT in action. Therefore, we use the example from the slides and start modifying it.

In [ ]:

def add_log(x, y):
    print('running Python')
    tf.print("running TensorFlow")
    x_sq = tf.math.log(x)
    y_sq = tf.math.log(y)
    return x_sq + y_sq

As seen before, we can use it like Python. To make sure that we know when the actual Python is executed, we inserted a print and a tf.print, the latter is a TensorFlow operation and therefore expected to be called everytime we compute something.

In [ ]:

add_log(4., 5.)

In [ ]:

add_log(42., 52.)

As we see, both the Python and TensorFlow operation execute. Now we can do the same with a decorator. Note that so far we entered pure Python numbers, not Tensors. Since we ran in eager mode, this did not matter so far.

In [ ]:

@tf.function(autograph=False)
def add_log_tf(x, y):
    print('running Python')
    tf.print("running TensorFlow")
    x_sq = tf.math.log(x)
    y_sq = tf.math.log(y)
    return x_sq + y_sq

In [ ]:

add_log_tf(1., 2.)

In [ ]:

add_log_tf(11., 21.)  # again with different numbers

As we see, Python is still run: this happens because 11. is not equal to 1., TensorFlow does not convert those to Tensors. Lets use it in the right way, with Tensors

In [ ]:

add_log_tf(tf.constant(1.), tf.constant(2.))  # first compilation

In [ ]:

add_log_tf(tf.constant(11.), tf.constant(22.))

Now only the TensorFlow operations get executed! Everything else became static. We can illustrate this more extremely here

In [ ]:

@tf.function(autograph=False)
def add_rnd(x):
    print('running Python')
    tf.print("running TensorFlow")
    rnd_np = np.random.uniform()
    rnd_tf = tf.random.uniform(shape=())
    return x * rnd_np, x * rnd_tf

In [ ]:

add_rnd(tf.constant(1.))

The first time, the numpy code was executed as well, no difference so far. However, running it a second time, only the TensorFlow parts can change

In [ ]:

add_rnd(tf.constant(1.))

In [ ]:

add_rnd(tf.constant(2.))

We see now clearly: TensorFlow executes the function but only cares about the TensorFlow operations , everything else is regarded as static. This can be a large pitfall! If we would execute this function without the decorator, we would get a different result, since Numpy is also sampling a new random variable every time.

Large functions¶

That having said, we can build graphs that require thousands of lines of Python code to stick them together correctly. Function calls in function calls etc are all possible.

Shapes¶

Tensors have a shape, similar to Numpy arrays. But Tensors have two kind of shapes, a static and a dynamic shape. The static shape is what can be inferred before executing the computation while the dynamic shape is only inferred during the execution of the code. The latter typically arises with random variables and masking or cuts.

We can access the static shape with Tensor.shape

If the shape is known inside a graph, this will be the same. If the shape is unknown, the unknown axis will be None.

In [ ]:

@tf.function(autograph=False)
def func_shape(x):
    print(f"static shape: {x.shape}")  # static shape
    tf.print('dynamic shape ',tf.shape(x))  # dynamic shape
    x = x[x>3.5]
    print(f"static shape cuts applied: {x.shape}")  # static shape
    tf.print('dynamic shape cuts applied',tf.shape(x))  # dynamic shape

In [ ]:

func_shape(rnd1)

We can access the axes by indexing

In [ ]:

rnd3 = rnd1[None, :] * rnd1[:, None]
rnd3

In [ ]:

tf.shape(rnd3)

In [ ]:

rnd3.shape[1]

Stateful: Variables¶

TensorFlow offers the possibility to have stateful objects inside a compiled graph (which e.g. is not possible with Numba). The most commonly used one is the tf.Variable. Technically, they are automatically captured on the function compilation and belong to it.

In [ ]:

var1 = tf.Variable(1.)

In [ ]:

@tf.function(autograph=False)
def scale_by_var(x):
    print('running Python')
    tf.print("running TensorFlow")
    return x * var1

In [ ]:

scale_by_var(tf.constant(1.))

In [ ]:

scale_by_var(tf.constant(2.))

In [ ]:

var1.assign(42.)
scale_by_var(tf.constant(1.))

As we see, the output changed. This is of course especially useful in the context of model fitting libraries, be it likelihoods or neural networks.

In [ ]:

def add_rnd(x):
    print('running Python')
    tf.print("running TensorFlow")
    rnd_np = np.random.uniform()
    rnd_tf = tf.random.uniform(shape=())
    return x * rnd_np, x * rnd_tf

In [ ]:

add_rnd(tf.constant(1.))

In [ ]:

add_rnd(tf.constant(2.))

This means that we can use Numpy fully compatible in eager mode, but not when decorated.

In [ ]:

def try_np_sqrt(x):
    return np.sqrt(x)

In [ ]:

try_np_sqrt(tf.constant(5.))

In [ ]:

try_np_sqrt_tf = tf.function(try_np_sqrt, autograph=False)  # equivalent to decorator

In [ ]:

try:
    try_np_sqrt_tf(tf.constant(5.))
except NotImplementedError as error:
    print(error)

As we see, Numpy complains in the graph mode, given that it cannot handle the Symbolic Tensor.

Dynamic control flow¶

Having the tf.function decorator means that we can't use any Python dynamicity. What fails when decorated but works nicely if not:

In [ ]:

def greater_python(x, y):
    if x > y:
        return True
    else:
        return False

In [ ]:

greater_python(tf.constant(1.), tf.constant(2.))

This works again, and will fail with the graph decorator.

In [ ]:

greater_python_tf = tf.function(greater_python, autograph=False)

In [ ]:

try:
    greater_python_tf(tf.constant(1.), tf.constant(2.))
except Exception as error:
    print(error)

Autograph¶

The error message hints at something: while this does not work now - Python does not yet now the value of the Tensors so it can't decide whether it will evaluate to True or False - there is the possibility of "autograph": it automatically converts (a subset) of Python to TensorFlow: while loops, for loops through Tensors and conditionals. However, this is usually less effective and more errorprone than using explicitly the tf.* functions. Lets try it!

In [ ]:

greater_python_tf_autograph = tf.function(greater_python, autograph=True)

In [ ]:

greater_python_tf_autograph(tf.constant(1.), tf.constant(2.))

This now works neatless! But we're never sure.

To do it explicitly, we can do that as well.

In [ ]:

code = tf.autograph.to_code(greater_python)
print(code)

Performance¶

In the end, this is what matters. And a comparison would be nice. Let's do that and see how Numpy and TensorFlow compare.

In [ ]:

nevents = 1000000
data_tf = tf.random.uniform(shape=(nevents,), dtype=tf.float64)
data_np = np.random.uniform(size=(nevents,))

In [ ]:

def calc_np(x):
    x_init = x
    i = 42.
    x = np.sqrt(np.abs(x_init * (i + 1.)))
    x = np.cos(x - 0.3)
    x = np.power(x, i + 1)
    x = np.sinh(x + 0.4)
    x = x ** 2
    x = x / np.mean(x)
    x = np.abs(x)
    logx = np.log(x)
    x = np.mean(logx)

    x1 = np.sqrt(np.abs(x_init * (i + 1.)))
    x1 = np.cos(x1 - 0.3)
    x1 = np.power(x1, i + 1)
    x1 = np.sinh(x1 + 0.4)
    x1 = x1 ** 2
    x1 = x1 / np.mean(x1)
    x1 = np.abs(x1)
    logx = np.log(x1)
    x1 = np.mean(logx)

    x2 = np.sqrt(np.abs(x_init * (i + 1.)))
    x2 = np.cos(x2 - 0.3)
    x2 = np.power(x2, i + 1)
    x2 = np.sinh(x2 + 0.4)
    x2 = x2 ** 2
    x2 = x2 / np.mean(x2)
    x2 = np.abs(x2)
    logx = np.log(x2)
    x2 = np.mean(logx)
    return x + x1 + x2


calc_np_numba = calc_np  # numba.jit(nopython=True, parallel=True)(calc_np)

In [ ]:

def calc_tf(x):
    x_init = x
    i = 42.
    x = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x = tf.cos(x - 0.3)
    x = tf.pow(x, tf.cast(i + 1, tf.float64))
    x = tf.sinh(x + 0.4)
    x = x ** 2
    x = x / tf.reduce_mean(x)
    x = tf.abs(x)
    x = tf.reduce_mean(tf.math.log(x))

    x1 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x1 = tf.cos(x1 - 0.3)
    x1 = tf.pow(x1, tf.cast(i + 1, tf.float64))
    x1 = tf.sinh(x1 + 0.4)
    x1 = x1 ** 2
    x1 = x1 / tf.reduce_mean(x1)
    x1 = tf.abs(x1)

    x2 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x2 = tf.cos(x2 - 0.3)
    x2 = tf.pow(x2, tf.cast(i + 1, tf.float64))
    x2 = tf.sinh(x2 + 0.4)
    x2 = x2 ** 2
    x2 = x2 / tf.reduce_mean(x2)
    x2 = tf.abs(x2)

    return x + x1 + x2


calc_tf_func = tf.function(calc_tf, autograph=False)

In [ ]:

%%timeit -n1 -r1  # compile time, just for curiosity
calc_tf_func(data_tf)

In [ ]:

%%timeit -n1 -r1  # compile time, just for curiosity
calc_np_numba(data_np)

In [ ]:

%timeit  -n1 calc_np(data_np)  # not compiled

In [ ]:

%timeit -n1 calc_tf(data_tf)  # not compiled

In [ ]:

%%timeit -n1 -r7
calc_np_numba(data_np)

In [ ]:

%%timeit -n1 -r7
calc_tf_func(data_tf)

We can now play around with this numbers. Depending on the size (we can go up to 10 mio) and parallelizability of the problem, the numbers differ..

In general:

Numpy is faster for small numbers
TensorFlow is faster for larger arrays and well parallelizable computations. Due to the larger overhead in dispatching in eager mode, it is significantly slower for very small (1-10) sample sizes.

=> there is no free lunch

Note: this has not run on a GPU, which would automatically happen for TensorFlow.

In [ ]:

def calc_tf2(x, n):
    sum_init = tf.zeros_like(x)
    for i in range(1, n + 1):
        x = tf.sqrt(tf.abs(x * (tf.cast(i, dtype=tf.float64) + 1.)))
        x = tf.cos(x - 0.3)
        x = tf.pow(x, tf.cast(i + 1, tf.float64))
        x = tf.sinh(x + 0.4)
        x = x ** 2
        x = x / tf.reduce_mean(x, axis=None)
        x = tf.abs(x)
        x = x - tf.reduce_mean(tf.math.log(x, name="Jonas_log"), name="Jonas_mean")  # name for ops, see later ;)
        sum_init += x
    return sum_init


calc_tf_func2 = tf.function(calc_tf2, autograph=False)

# @numba.njit(parallel=True)  # njit is equal to jit(nopython=True), meaning "compile everything or raise error"


def calc_numba2(x, n):
    sum_init = np.zeros_like(x)
    for i in range(1, n + 1):
        x = np.sqrt(np.abs(x * (i + 1.)))
        x = np.cos(x - 0.3)
        x = np.power(x, i + 1)
        x = np.sinh(x + 0.4)
        x = x ** 2
        x = x / np.mean(x)
        x = np.abs(x)
        x = x - np.mean(np.log(x))
        sum_init += x
    return sum_init

In [ ]:

%%timeit -n1 -r1  #compile
calc_numba2(rnd1_big.numpy(), 1)

In [ ]:

calc_numba2(rnd1_big.numpy(), 1)

In [ ]:

%%timeit -n1 -r1  #compile
calc_tf_func2(rnd1_big, 1)

In [ ]:

calc_tf_func2(rnd1_big, 1)

In [ ]:

%%timeit -n10
calc_numba2(rnd1_big.numpy(), 1)

In [ ]:

%%timeit -n10
calc_tf_func2(rnd1_big, 1)

In [ ]:

calc_tf_func2(rnd1_big, 10)

In [ ]:

calc_numba2(rnd1_big.numpy(), 10)

In [ ]:

%%timeit -n10
calc_numba2(rnd1_big.numpy(), 10)

In [ ]:

%%timeit -n10
calc_tf_func2(rnd1_big, 10)

Explicit dynamic control flow¶

While TensorFlow is independent of the Python control flow, it has its own functions for that, mainly:

while_loop(): a while loop taking a body and condition function
cond(): if-like
case and switch_case: if/elif statements
tf.where

In [ ]:

def true_fn():
    return 1.

def false_fn():
    return 0.

value = tf.cond(tf.greater(111., 42.), true_fn=true_fn, false_fn=false_fn)

In [ ]:

value

While loops¶

We can create while loops in order to have some kind of repetitive task

In [ ]:

def cond(x, y):
    return x > y


def body(x, y):
    return x / 2, y + 1


x, y = tf.while_loop(cond=cond,
                     body=body,
                     loop_vars=[100., 1.])

In [ ]:

x, y

map a function¶

We can also map a function on each element. While this is not very efficient, it allows for high flexibility.

In [ ]:

%%timeit -n1 -r1
tf.map_fn(tf.math.sin, rnd1_big[:min(10000, nevents)])  # takes a long time in full, 40+ secs
# stronlgy reduced datasample! Already takes seconds

That took a whole lot of time (remember, it's strongly reduced!). That's because if we run the map_fn in eager mode (i.e. not in a tf.function decorated function), it will fallback to pure Python: a for-loop).

In comparison, vecorized_map always compiles the function.

In [ ]:

%%timeit -n1 -r1
tf.vectorized_map(tf.math.sin, rnd1_big)  # can speedup things sometimes

In [ ]:

@tf.function(autograph=False)
def do_map(func, tensor):
    return tf.map_fn(func, tensor)


do_map(tf.math.sin, rnd1_big)

In [ ]:

@tf.function(autograph=False)
def do_map_vec(func, tensor):
    return tf.vectorized_map(func, tensor)


do_map_vec(tf.math.sin, rnd1_big)

In [ ]:

%%timeit  -n1
do_map(tf.math.sin, rnd1_big)

In [ ]:

%%timeit -n1
do_map_vec(tf.math.sin, rnd1_big)

In [ ]:

%%timeit -n100
tf.math.sin(rnd1_big)

As we can see, the generic mapping is surely not optimal. However, it works "always". vectorized_map on the other hand has a huge speedup and performs nearly as well as using the native function! However, while this works nicely for this case, it's applications are limited and depend heavily on the use-case; more complicated examples can easily result in a longer runtime and a huge memory consumption. Caution is therefore advised when using this function.

Gradients¶

TensorFlow allows us to calculate the automatic gradients.

In [ ]:

var2 = tf.Variable(2.)

In [ ]:

with tf.GradientTape() as tape:
    tape.watch(var2)  # actually watches all variables already by default
    y = var2 ** 3
y

In [ ]:

grad = tape.gradient(y, var2)
grad

Tasks to try out¶

Following are a few ideas to check whether you understand things

can you get the second derivative? play around ;)
write a shape that you know as a function
create the MC integral of it over a certain range

This allows to do many things with gradients and e.g. solve differential equations.

Bonus: the graph¶

We talked about the graph back and forth, but where is it ?

The graph can be retained from a function that was already traced.

In [ ]:

concrete_func = calc_tf_func2.get_concrete_function(rnd1, 2)
concrete_func

In [ ]:

graph = concrete_func.graph
graph

In [ ]:

ops = graph.get_operations()
ops

In [ ]:

log_op = ops[-6]
log_op

In [ ]:

log_op.outputs

In [ ]:

op_inputs_mean = ops[-4].inputs
op_inputs_mean

In [ ]:

log_op.outputs[0] is op_inputs_mean[0]

The output of the log operation is the input to the mean operation! We can just walk along the graph here. TensorFlow Graphs are no magic, they are simple object that store their input, their output, their operation. That's it!