I came across some very fresh and exciting data science tools in the past few days and felt very inspired to try these.

One package that I was interested in a lot for a long time is bokeh, as it almost perfectly matches the jupyter notebook with its interactive capabilities and I definitly wanted to try an alternative to the excellent matplotlib package that I am using all the time. Another real neat thing is that *bokeh* based documents stay alive even when they have been persisted into a plain html file, as all data necesary for visualization is contained within that file.

Another package is datashader. The examples for visualizing the taxi travels in Manhattan as well as the census data for the U.S. are very impressive. So I definitly wanted to play around with that as well.

Finally, as I am one of the big data guys, I was really impressed by dask and its distributed backend, which allows to spread the work necessary for large computations over a cluster of nodes via a very common and well known mapping api.

So, how to start exploring all this?

Following the provided examples is usually one way to learn about the capabilities. However, often they are sort of specialized and minimalist, not directly supporting a transfer of the examples to current real world tasks. What helps me usually is to try to solve a well known problem using the new tools in order to learn.

As I've done some work on showing how to accelerate numerical computations within python lately using numba and did this using the example of speeding up the calculation of the Mandelbrot Set, it was quite obvious that this was the right path here as well.

So I ended up with a jupyter notebook to calculate the famous fractal again, this time using `bokeh`

for displaying the image and adding interaction, the `datashader`

as a callback to the zoom events performing the `numba`

accelarated calculations for every new selection and the `distributed`

backend to spread the work over a cluster of 160 CPUs.

That was fun!

And here is how this works.

It obviously all starts with importing the necessary modules. There is `numpy`

and `numba`

for the maths, `bokeh`

and `datashader`

for the plotting and `distributed`

for scaling out work.

In [1]:

```
from functools import partial
import numpy as np
import numba
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
import datashader as ds
import datashader.transfer_functions as tf
from datashader.bokeh_ext import InteractiveImage
import xarray as xr
from distributed import Executor
output_notebook()
```

*Mandelbrot Set*, with the x coordinates defining the boundaries on the real axis and the y coordinates for the complex one. A rather large value for the maximum number of iterations ensures enough work to be distributed on a cluster of nodes and enough detail when zooming in. Feel free to adjust this number to suit your environment.

In [2]:

```
xmin=-2.5
xmax=1.0
ymin=-1.25
ymax=1.25
maxiterations=np.uint64(5000)
```

The executor is the central instance organizing a pool of workers. It consists of a bunch of services, with two of them being most prominent. The first one is the scheduler which delegates the work packages to the cluster and the second is the worker instance running on each node, which is doing all the heavy computations. Whilst workers may come and go within a `distributed`

environment the scheduler needs to be up and running, and remain accessible all the time in order to keep the system operable. The `distributed`

package comes with a very useful helper script called `dask-ssh`

which spawns all necessary components on the nodes, given you have a transparent (passwordless) access to all nodes.
The command

```
dask-ssh linpl00{00,01,02,03,04,05,06,07,08,09,10} --nprocs 16
```

for example starts the scheduler instance on the first node of the cluster named `linpl0001`

and 16 instances of workers on all ten nodes of the cluster resulting in 160 workers all in all. Adjusting the `--nprocs`

parameter helps if the code you are distributing is not able to release the GIL. This trick is pretty comparable to using the `multiprocessing`

module instead of native `threading`

.

If you're not able to use a cluster of nodes as a backend for the computations, simply call the Executor with no arguments. This will start a scheduler and a worker instance in a separate thread on your computer. But, this will of course remove one dimension of fun from this whole thing. Keep that in mind.

In [3]:

```
e=Executor("linpl0001:8786")
e
```

Out[3]:

<Executor: scheduler="linpl0001:8786" processes=160 cores=160>

*status* page on your scheduler node behind port *8787* to follow what the scheduler is currently doing. This would be http://linpl0001:8787/status in my case.

The following function is the central function applied to each coordinate in the selected part of the complex number plane. A coordinate is part of the mandelbrot set if the series of numbers created by the iterations does converge. This is considered to be the case if either the absolute value of the current result is lower than 2 or no divergance has been determined after the maximum number of iteration.

With `@numba.jit`

the pure python function is transferred into native machine code speeding up the calculation a lot and also allowing to release the GIL for the calculations.

In [4]:

```
@numba.jit("uint64(complex128,uint64)", nopython=True, nogil=True)
def mandel(c, maxiterations):
z=c
for iteration in range(maxiterations):
if abs(z)>2.0:
return iteration
z=z**2+c
return maxiterations
```

The `chunked_mandel`

function below is the function that is actually being called by the workers on the several nodes. It receives a part of the array of complex numbers as its input and returns the number of iterations done to check if the coordinate is part of the *Mandelbrot Set*. This wrapper function helps to work on a splited input array and therefore supports distributing the work load.

The `partial`

helper is used to create a function with one parameter - the maximum number of iterations in this case - being fixed. This eases satisfying the calling convention for the mapping api we will be using later on.

In [5]:

```
mandel_p=partial(mandel, maxiterations=maxiterations)
def chunked_mandel(chunk):
return [mandel_p(c) for c in chunk]
```

The `update_image`

function is the callback function for datashaders `InteractiveImage`

, which is triggered on every zoom action for example. Within this function the current view of the *Mandelbrot Set* is calculated. To do this two arrays `x`

and `y`

reflecting the resolution of the plot (`w`

x`h`

) are created for the real and imaginary axis. Based on these axes the array of complex numbers `c`

is created covering the current selection of the complex plane that shall be visualized. To support distributed calculations a 1 dimensional view `cr`

on the complex array is created which is splitted into a list of chunks `cc`

afterwards, whith each chunk being one row of the original array `c`

. By doing so every worker will perform the calculations for one row in the complex plane at a time, which is triggered via the `map`

calls applying the `chunked_mandel`

function to each element in the list of chunks. The results can afterwards be collected via `gather`

in correct order.

The new resulting image is created using `datashader`

s `interpolate`

transfer function, which expects an `xarray`

data structure as input, which is itself created from the gathered results transformed into a `numpy.array`

via vertically stacking all intermediate results for the rows just caculated.

The color mapping has been chosen from `blue`

over `red`

and `yellow`

to `black`

for few to the maximum number of iterations, whilst the mapping function used for the colorization is logarithmic. Therefore the colors do not directly reflect the number of iterations, but give an impression of the order the resulting function for the `Mandelbrot Set`

calculation had after the last iteration. This colorization strategy is a usual concept for having smoother color transitions in the view.

In [6]:

```
def update_image(x_range, y_range, w, h):
xmin, xmax=x_range
ymin, ymax=y_range
x=np.linspace(xmin, xmax, w, dtype=np.float64)
y=np.linspace(ymin, ymax, h, dtype=np.float64)
c=x+y[:, None]*1j
cr=c.ravel()
cc=np.array_split(cr, h)
futures=e.map(chunked_mandel, cc)
dresults=e.gather(iter(futures))
iterations=list(dresults)
image=tf.interpolate(xr.DataArray(np.vstack(iterations)),
cmap=["blue", "red", "yellow", "black"],
how="log")
return image
```

Finally the figure to display the results is created. I've chosen a view spanning the whole width of the jupyter notebook, and selected controls for panning, zooming and resetting the view. The initial view coordinates define the initial axis ranges for the view and the look of the plot has been adjusted a bit.

After that the `InteractiveImage`

function is called with the plot `p`

and the `update_image`

function as its callback, allowing interactive exploration of the `Mandelbrot Set`

in the jupyter notebook.

In [7]:

```
tools="pan,wheel_zoom,box_zoom,reset"
p = figure(tools=tools, plot_width=980, plot_height=600,
x_range=(xmin, xmax), y_range=(ymin, ymax),
outline_line_color=None, background_fill_color='blue',
min_border=0, min_border_left=0, min_border_right=0, min_border_top=0, min_border_bottom=0)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
InteractiveImage(p, update_image)
```

Out[7]: