try:
import google.colab
except ImportError:
pass
else:
!pip install zfit numpy tensorflow_probability
import tensorflow as tf
Depending on the CPU used and the TensorFlow binaries, what we will see (not in the Jupyter Notebook) are a bunch of messages, including the following:
tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
What could they mean?
AVX stands for Advanced Vector Extensions and are instruction that the CPU can perform. They are specializations on vector operations (remember? SIMD, CPU inststruction set, etc.)
Why do they appear?
The code that we are using was not compiled with this flag on. This means, TensorFlow assumes that the CPU does not support this instructions and instead uses non-optimized ones. The reason is that this allows the binary (=compiled code) to also be run on a CPU that does not support then. While we use only some speed. (yes, technically TensorFlow can be faster when compiled natively on your computer, but then it takes time and effort)
# import numba # numba fails, not compatible with current python
import numpy as np
import tensorflow_probability as tfp
import zfit
from zfit import z
Let's start with a simple comparison of Numpy, an AOT compiled library, versus pure Python
size1 = 100000
list1 = [np.random.uniform() for _ in range(size1)]
list2 = [np.random.uniform() for _ in range(size1)]
list_zeros = [0] * size1
ar1 = np.array(list1)
ar2 = np.random.uniform(size=size1) # way more efficient!
ar_zeros = np.zeros_like(ar1) # quite useful function the *_like -> like the object
# we could also create the below, in general better:
# ar_empty = np.empty(shape=size1, dtype=np.float64)
%%timeit -n10
for i in range(size1):
list_zeros[i] = list1[i] + list2[i]
%%timeit -n10
ar1 + ar2
( playground : we can also try assignements here or simliar)
Before we go deeper into the topic, we can draw two conclusions:
=> there is no reason to ever add (numerical) arrays with a for loop in Python (except for numba jit)
As mentioned, TensorFlow is basically Numpy. Let's check that out
rnd1_np = np.random.uniform(size=10, low=0, high=10)
rnd1_np # adding a return value on the last line without assigning it prints the value
rnd1 = tf.random.uniform(shape=(10,), # notice the "shape" argument: it's more picky than Numpy
minval=0,
maxval=10,
dtype=tf.float64)
rnd2 = tf.random.uniform(shape=(10,),
minval=0,
maxval=10,
dtype=tf.float64)
rnd1
This is in fact a "numpy array wrapped" and can explicitly be converted to an array
rnd1.numpy()
Other operations act as we would expect it
rnd1 + 10
... and it converts itself (often) to Numpy when needed.
np.sqrt(rnd1)
We can slice it...
rnd1[1:3]
...expand it....
rnd1[None, :, None]
...and broadcast with the known (maybe slightly stricter) rules
matrix1 = rnd1[None, :] * rnd1[:, None]
matrix1
Many operations that exist in Numpy also exist in TensorFlow, sometimes with a different name.
The concept however is exactly the same: we have higher level objects such as Tensors (or arrays) and call operations on it with arguments. This is a "strong limitation" (theoretically) of what we can do, however, since we do math, there is only a limited set we need, and in practice this suffices for 98% of the cases.
Therefore we won't dive too deep into the possibilities of TensorFlow/Numpy regarding operations but it is suggested to read the API docs, many are self-explanatory. It can be surprising that there is also some support for more exotic elements such as RaggedTensors and operations and SparseTensors and operations
Mostly, the differences and the terminology will be introduced.
tf.sqrt(rnd1)
tf.reduce_sum(matrix1, axis=0) # with the axis argument to specify over which to reduce
TensorFlow is more picky on dtypes as Numpy and does not automatically cast dtypes. That's why we can often get a dtype error. Solution: make sure you add a x = tf.cast(x, dtype=tf.float64)
(or whatever dtype we want) to cast it into the right dtype.
One noticable difference: TensorFlow uses float32 as the default for all operations. Neural Networks function quite well with that (sometimes even with float16) but for (most) scientific use-cases, we want to use float64. So yes, currently, we have to define this in (too) many places.
Sidenote : Also zfit uses tf.float64 throughout internally. It's an often seen problem that custom shapes will raise a dtype error.
The idea of TensorFlow evolves around building a Graph$^{1)}$, the Eager (Numpy-like) execution is more of a "we get it for free addition", but does not lead the development ideas. This has one profound implication, namely that we cannot make an assignement to a Tensor, because it is a node in a graph. The logic just does not work (exception: tf.Variable
$^{2)}). This does not mean that TensorFlow would not perform in-place operations behind the scenes - it very well does if it is save to do so. Since TensorFlow knows the whole graph with all dependencies, this can be figured out.
Even if EagerMode, assignements could work (as for Numpy arrays), they are forbidden for consistency (one of the great plus points of TensorFlow).
an advanced structure that can also be modified is the tf.TensorArray
, which acts like a list or ArrayList (in other languages). It is however advanced and not usually used, so we won't go into the details here.
1) if you're familiar with TensorFlow 1, this statement would suprise you as pretty obvious; but in TensorFlow 2, this is luckily more hidden.
2) more general, ResourceHandler
rnd1_np[5] = 42
try:
rnd1[5] = 42
except TypeError as error:
print(error)
Let's do the same calculation as with Numpy. The result should be comparable: both are AOT compiled libraries specialized on numerical, vectorized operations.
rnd1_big = tf.random.uniform(shape=(size1,), # notice the "shape" argument: it's more picky than Numpy
minval=0,
maxval=10,
dtype=tf.float64)
rnd2_big = tf.random.uniform(shape=(size1,),
minval=0,
maxval=10,
dtype=tf.float64)
%%timeit -n100
rnd1_big + rnd2_big
%%timeit -n100 # using numpy, same as before
ar1 + ar2
Looks like the same as Numpy. Let's compare with smaller arrays
rnd1_np = rnd1.numpy()
rnd2_np = rnd2.numpy()
rnd1_np
%%timeit -n100
rnd1_np + rnd2_np
%%timeit -n100
rnd1 + rnd2
We see now a significant difference in the runtime! This is because TensorFlow has a larger overhead than Numpy. As seen before, this is not/barely noticable for larger arrays, however for very small calculations, this is visible.
There is more overhead because TensorFlow tries to be "smarter" about many things than Numpy and does not simply directly execute the computation.
The cost is a slowdown on very small operations but a better scaling and improved performance with larger arrays and more complicated calculations.
# relative speeds may differ, depending on the hardware used.
# size_big = 10 # numpy faster
size_big = 20000 # sameish
# size_big = 100000 # TF faster
# size_big = 1000000 # TF faster
# size_big = 10000000 # TF faster
# size_big = 100000000 # TF faster
%%timeit -n10
tf.random.uniform(shape=(size_big,), dtype=tf.float64) + tf.random.uniform(shape=(size_big,), dtype=tf.float64)
%%timeit -n10
np.random.uniform(size=(size_big,)) + np.random.uniform(size=(size_big,))
In general, TensorFlow is preciser in what input arguments are required compared to Numpy and does less automatic dtype casting and asks more explicit for shapes. For example, integers don't work in the logarithm. However, this error message illustrates very well the kernel dispatch system of TensorFlow, so lets do it!
try:
tf.math.log(5)
except (tf.errors.NotFoundError, tf.errors.InvalidArgumentError) as error: # TF 2.4 error, TF 2.5 error
print(error)
What we see here: it searches the registered kernels and does not find any that supports this operation. We find different classifications:
Let's see the JIT in action. Therefore, we use the example from the slides and start modifying it.
def add_log(x, y):
print('running Python')
tf.print("running TensorFlow")
x_sq = tf.math.log(x)
y_sq = tf.math.log(y)
return x_sq + y_sq
As seen before, we can use it like Python. To make sure that we know when the actual Python is executed, we inserted a print and a tf.print
, the latter is a TensorFlow operation and therefore expected to be called everytime we compute something.
add_log(4., 5.)
add_log(42., 52.)
As we see, both the Python and TensorFlow operation execute. Now we can do the same with a decorator. Note that so far we entered pure Python numbers, not Tensors. Since we ran in eager mode, this did not matter so far.
@tf.function(autograph=False)
def add_log_tf(x, y):
print('running Python')
tf.print("running TensorFlow")
x_sq = tf.math.log(x)
y_sq = tf.math.log(y)
return x_sq + y_sq
add_log_tf(1., 2.)
add_log_tf(11., 21.) # again with different numbers
As we see, Python is still run: this happens because 11. is not equal to 1., TensorFlow does not convert those to Tensors. Lets use it in the right way, with Tensors
add_log_tf(tf.constant(1.), tf.constant(2.)) # first compilation
add_log_tf(tf.constant(11.), tf.constant(22.))
Now only the TensorFlow operations get executed! Everything else became static. We can illustrate this more extremely here
@tf.function(autograph=False)
def add_rnd(x):
print('running Python')
tf.print("running TensorFlow")
rnd_np = np.random.uniform()
rnd_tf = tf.random.uniform(shape=())
return x * rnd_np, x * rnd_tf
add_rnd(tf.constant(1.))
The first time, the numpy code was executed as well, no difference so far. However, running it a second time, only the TensorFlow parts can change
add_rnd(tf.constant(1.))
add_rnd(tf.constant(2.))
We see now clearly: TensorFlow executes the function but only cares about the TensorFlow operations , everything else is regarded as static. This can be a large pitfall! If we would execute this function without the decorator, we would get a different result, since Numpy is also sampling a new random variable every time.
That having said, we can build graphs that require thousands of lines of Python code to stick them together correctly. Function calls in function calls etc are all possible.
Tensors have a shape, similar to Numpy arrays. But Tensors have two kind of shapes, a static and a dynamic shape. The static shape is what can be inferred before executing the computation while the dynamic shape is only inferred during the execution of the code. The latter typically arises with random variables and masking or cuts.
We can access the static shape with Tensor.shape
If the shape is known inside a graph, this will be the same. If the shape is unknown, the unknown axis will be None.
@tf.function(autograph=False)
def func_shape(x):
print(f"static shape: {x.shape}") # static shape
tf.print('dynamic shape ',tf.shape(x)) # dynamic shape
x = x[x>3.5]
print(f"static shape cuts applied: {x.shape}") # static shape
tf.print('dynamic shape cuts applied',tf.shape(x)) # dynamic shape
func_shape(rnd1)
We can access the axes by indexing
rnd3 = rnd1[None, :] * rnd1[:, None]
rnd3
tf.shape(rnd3)
rnd3.shape[1]
TensorFlow offers the possibility to have stateful objects inside a compiled graph (which e.g. is not possible with Numba). The most commonly used one is the tf.Variable
. Technically, they are automatically captured on the function compilation and belong to it.
var1 = tf.Variable(1.)
@tf.function(autograph=False)
def scale_by_var(x):
print('running Python')
tf.print("running TensorFlow")
return x * var1
scale_by_var(tf.constant(1.))
scale_by_var(tf.constant(2.))
var1.assign(42.)
scale_by_var(tf.constant(1.))
As we see, the output changed. This is of course especially useful in the context of model fitting libraries, be it likelihoods or neural networks.
def add_rnd(x):
print('running Python')
tf.print("running TensorFlow")
rnd_np = np.random.uniform()
rnd_tf = tf.random.uniform(shape=())
return x * rnd_np, x * rnd_tf
add_rnd(tf.constant(1.))
add_rnd(tf.constant(2.))
This means that we can use Numpy fully compatible in eager mode, but not when decorated.
def try_np_sqrt(x):
return np.sqrt(x)
try_np_sqrt(tf.constant(5.))
try_np_sqrt_tf = tf.function(try_np_sqrt, autograph=False) # equivalent to decorator
try:
try_np_sqrt_tf(tf.constant(5.))
except NotImplementedError as error:
print(error)
As we see, Numpy complains in the graph mode, given that it cannot handle the Symbolic Tensor.
Having the tf.function
decorator means that we can't use any Python dynamicity. What fails when decorated but works nicely if not:
def greater_python(x, y):
if x > y:
return True
else:
return False
greater_python(tf.constant(1.), tf.constant(2.))
This works again, and will fail with the graph decorator.
greater_python_tf = tf.function(greater_python, autograph=False)
try:
greater_python_tf(tf.constant(1.), tf.constant(2.))
except Exception as error:
print(error)
The error message hints at something: while this does not work now - Python does not yet now the value of the Tensors so it can't decide whether it will evaluate to True or False - there is the possibility of "autograph": it automatically converts (a subset) of Python to TensorFlow: while loops, for loops through Tensors and conditionals. However, this is usually less effective and more errorprone than using explicitly the tf.*
functions. Lets try it!
greater_python_tf_autograph = tf.function(greater_python, autograph=True)
greater_python_tf_autograph(tf.constant(1.), tf.constant(2.))
This now works neatless! But we're never sure.
To do it explicitly, we can do that as well.
code = tf.autograph.to_code(greater_python)
print(code)
In the end, this is what matters. And a comparison would be nice. Let's do that and see how Numpy and TensorFlow compare.
nevents = 1000000
data_tf = tf.random.uniform(shape=(nevents,), dtype=tf.float64)
data_np = np.random.uniform(size=(nevents,))
def calc_np(x):
x_init = x
i = 42.
x = np.sqrt(np.abs(x_init * (i + 1.)))
x = np.cos(x - 0.3)
x = np.power(x, i + 1)
x = np.sinh(x + 0.4)
x = x ** 2
x = x / np.mean(x)
x = np.abs(x)
logx = np.log(x)
x = np.mean(logx)
x1 = np.sqrt(np.abs(x_init * (i + 1.)))
x1 = np.cos(x1 - 0.3)
x1 = np.power(x1, i + 1)
x1 = np.sinh(x1 + 0.4)
x1 = x1 ** 2
x1 = x1 / np.mean(x1)
x1 = np.abs(x1)
logx = np.log(x1)
x1 = np.mean(logx)
x2 = np.sqrt(np.abs(x_init * (i + 1.)))
x2 = np.cos(x2 - 0.3)
x2 = np.power(x2, i + 1)
x2 = np.sinh(x2 + 0.4)
x2 = x2 ** 2
x2 = x2 / np.mean(x2)
x2 = np.abs(x2)
logx = np.log(x2)
x2 = np.mean(logx)
return x + x1 + x2
calc_np_numba = calc_np # numba.jit(nopython=True, parallel=True)(calc_np)
def calc_tf(x):
x_init = x
i = 42.
x = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
x = tf.cos(x - 0.3)
x = tf.pow(x, tf.cast(i + 1, tf.float64))
x = tf.sinh(x + 0.4)
x = x ** 2
x = x / tf.reduce_mean(x)
x = tf.abs(x)
x = tf.reduce_mean(tf.math.log(x))
x1 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
x1 = tf.cos(x1 - 0.3)
x1 = tf.pow(x1, tf.cast(i + 1, tf.float64))
x1 = tf.sinh(x1 + 0.4)
x1 = x1 ** 2
x1 = x1 / tf.reduce_mean(x1)
x1 = tf.abs(x1)
x2 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
x2 = tf.cos(x2 - 0.3)
x2 = tf.pow(x2, tf.cast(i + 1, tf.float64))
x2 = tf.sinh(x2 + 0.4)
x2 = x2 ** 2
x2 = x2 / tf.reduce_mean(x2)
x2 = tf.abs(x2)
return x + x1 + x2
calc_tf_func = tf.function(calc_tf, autograph=False)
%%timeit -n1 -r1 # compile time, just for curiosity
calc_tf_func(data_tf)
%%timeit -n1 -r1 # compile time, just for curiosity
calc_np_numba(data_np)
%timeit -n1 calc_np(data_np) # not compiled
%timeit -n1 calc_tf(data_tf) # not compiled
%%timeit -n1 -r7
calc_np_numba(data_np)
%%timeit -n1 -r7
calc_tf_func(data_tf)
We can now play around with this numbers. Depending on the size (we can go up to 10 mio) and parallelizability of the problem, the numbers differ..
In general:
=> there is no free lunch
Note: this has not run on a GPU, which would automatically happen for TensorFlow.
def calc_tf2(x, n):
sum_init = tf.zeros_like(x)
for i in range(1, n + 1):
x = tf.sqrt(tf.abs(x * (tf.cast(i, dtype=tf.float64) + 1.)))
x = tf.cos(x - 0.3)
x = tf.pow(x, tf.cast(i + 1, tf.float64))
x = tf.sinh(x + 0.4)
x = x ** 2
x = x / tf.reduce_mean(x, axis=None)
x = tf.abs(x)
x = x - tf.reduce_mean(tf.math.log(x, name="Jonas_log"), name="Jonas_mean") # name for ops, see later ;)
sum_init += x
return sum_init
calc_tf_func2 = tf.function(calc_tf2, autograph=False)
# @numba.njit(parallel=True) # njit is equal to jit(nopython=True), meaning "compile everything or raise error"
def calc_numba2(x, n):
sum_init = np.zeros_like(x)
for i in range(1, n + 1):
x = np.sqrt(np.abs(x * (i + 1.)))
x = np.cos(x - 0.3)
x = np.power(x, i + 1)
x = np.sinh(x + 0.4)
x = x ** 2
x = x / np.mean(x)
x = np.abs(x)
x = x - np.mean(np.log(x))
sum_init += x
return sum_init
%%timeit -n1 -r1 #compile
calc_numba2(rnd1_big.numpy(), 1)
calc_numba2(rnd1_big.numpy(), 1)
%%timeit -n1 -r1 #compile
calc_tf_func2(rnd1_big, 1)
calc_tf_func2(rnd1_big, 1)
%%timeit -n10
calc_numba2(rnd1_big.numpy(), 1)
%%timeit -n10
calc_tf_func2(rnd1_big, 1)
calc_tf_func2(rnd1_big, 10)
calc_numba2(rnd1_big.numpy(), 10)
%%timeit -n10
calc_numba2(rnd1_big.numpy(), 10)
%%timeit -n10
calc_tf_func2(rnd1_big, 10)
While TensorFlow is independent of the Python control flow, it has its own functions for that, mainly:
def true_fn():
return 1.
def false_fn():
return 0.
value = tf.cond(tf.greater(111., 42.), true_fn=true_fn, false_fn=false_fn)
value
We can create while loops in order to have some kind of repetitive task
def cond(x, y):
return x > y
def body(x, y):
return x / 2, y + 1
x, y = tf.while_loop(cond=cond,
body=body,
loop_vars=[100., 1.])
x, y
We can also map a function on each element. While this is not very efficient, it allows for high flexibility.
%%timeit -n1 -r1
tf.map_fn(tf.math.sin, rnd1_big[:min(10000, nevents)]) # takes a long time in full, 40+ secs
# stronlgy reduced datasample! Already takes seconds
That took a whole lot of time (remember, it's strongly reduced!). That's because if we run the map_fn
in eager mode (i.e. not in a tf.function
decorated function), it will fallback to pure Python: a for-loop).
In comparison, vecorized_map
always compiles the function.
%%timeit -n1 -r1
tf.vectorized_map(tf.math.sin, rnd1_big) # can speedup things sometimes
@tf.function(autograph=False)
def do_map(func, tensor):
return tf.map_fn(func, tensor)
do_map(tf.math.sin, rnd1_big)
@tf.function(autograph=False)
def do_map_vec(func, tensor):
return tf.vectorized_map(func, tensor)
do_map_vec(tf.math.sin, rnd1_big)
%%timeit -n1
do_map(tf.math.sin, rnd1_big)
%%timeit -n1
do_map_vec(tf.math.sin, rnd1_big)
%%timeit -n100
tf.math.sin(rnd1_big)
As we can see, the generic mapping is surely not optimal. However, it works "always". vectorized_map
on the other hand has a huge speedup and performs nearly as well as using the native function! However, while this works nicely for this case, it's applications are limited and depend heavily on the use-case; more complicated examples can easily result in a longer runtime and a huge memory consumption. Caution is therefore advised when using this function.
TensorFlow allows us to calculate the automatic gradients.
var2 = tf.Variable(2.)
with tf.GradientTape() as tape:
tape.watch(var2) # actually watches all variables already by default
y = var2 ** 3
y
grad = tape.gradient(y, var2)
grad
Following are a few ideas to check whether you understand things
This allows to do many things with gradients and e.g. solve differential equations.
We talked about the graph back and forth, but where is it ?
The graph can be retained from a function that was already traced.
concrete_func = calc_tf_func2.get_concrete_function(rnd1, 2)
concrete_func
graph = concrete_func.graph
graph
ops = graph.get_operations()
ops
log_op = ops[-6]
log_op
log_op.outputs
op_inputs_mean = ops[-4].inputs
op_inputs_mean
log_op.outputs[0] is op_inputs_mean[0]
The output of the log operation is the input to the mean operation! We can just walk along the graph here. TensorFlow Graphs are no magic, they are simple object that store their input, their output, their operation. That's it!