I came across some very fresh and exciting data science tools in the past few days and felt very inspired to try these.
One package that I was interested in a lot for a long time is bokeh, as it almost perfectly matches the jupyter notebook with its interactive capabilities and I definitly wanted to try an alternative to the excellent matplotlib package that I am using all the time. Another real neat thing is that bokeh based documents stay alive even when they have been persisted into a plain html file, as all data necesary for visualization is contained within that file.
Another package is datashader. The examples for visualizing the taxi travels in Manhattan as well as the census data for the U.S. are very impressive. So I definitly wanted to play around with that as well.
Finally, as I am one of the big data guys, I was really impressed by dask and its distributed backend, which allows to spread the work necessary for large computations over a cluster of nodes via a very common and well known mapping api.
So, how to start exploring all this?
Following the provided examples is usually one way to learn about the capabilities. However, often they are sort of specialized and minimalist, not directly supporting a transfer of the examples to current real world tasks. What helps me usually is to try to solve a well known problem using the new tools in order to learn.
As I've done some work on showing how to accelerate numerical computations within python lately using numba and did this using the example of speeding up the calculation of the Mandelbrot Set, it was quite obvious that this was the right path here as well.
So I ended up with a jupyter notebook to calculate the famous fractal again, this time using bokeh
for displaying the image and adding interaction, the datashader
as a callback to the zoom events performing the numba
accelarated calculations for every new selection and the distributed
backend to spread the work over a cluster of 160 CPUs.
That was fun!
And here is how this works.
It obviously all starts with importing the necessary modules. There is numpy
and numba
for the maths, bokeh
and datashader
for the plotting and distributed
for scaling out work.
from functools import partial
import numpy as np
import numba
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
import datashader as ds
import datashader.transfer_functions as tf
from datashader.bokeh_ext import InteractiveImage
import xarray as xr
from distributed import Executor
output_notebook()
Next are some definitions. The following coordinates define the standard view of the Mandelbrot Set, with the x coordinates defining the boundaries on the real axis and the y coordinates for the complex one. A rather large value for the maximum number of iterations ensures enough work to be distributed on a cluster of nodes and enough detail when zooming in. Feel free to adjust this number to suit your environment.
xmin=-2.5
xmax=1.0
ymin=-1.25
ymax=1.25
maxiterations=np.uint64(5000)
The executor is the central instance organizing a pool of workers. It consists of a bunch of services, with two of them being most prominent. The first one is the scheduler which delegates the work packages to the cluster and the second is the worker instance running on each node, which is doing all the heavy computations. Whilst workers may come and go within a distributed
environment the scheduler needs to be up and running, and remain accessible all the time in order to keep the system operable. The distributed
package comes with a very useful helper script called dask-ssh
which spawns all necessary components on the nodes, given you have a transparent (passwordless) access to all nodes.
The command
dask-ssh linpl00{00,01,02,03,04,05,06,07,08,09,10} --nprocs 16
for example starts the scheduler instance on the first node of the cluster named linpl0001
and 16 instances of workers on all ten nodes of the cluster resulting in 160 workers all in all. Adjusting the --nprocs
parameter helps if the code you are distributing is not able to release the GIL. This trick is pretty comparable to using the multiprocessing
module instead of native threading
.
If you're not able to use a cluster of nodes as a backend for the computations, simply call the Executor with no arguments. This will start a scheduler and a worker instance in a separate thread on your computer. But, this will of course remove one dimension of fun from this whole thing. Keep that in mind.
e=Executor("linpl0001:8786")
e
<Executor: scheduler="linpl0001:8786" processes=160 cores=160>
The executor also allows to visualize his current load and scheduling performance. Simply open the status page on your scheduler node behind port 8787 to follow what the scheduler is currently doing. This would be http://linpl0001:8787/status in my case.
The following function is the central function applied to each coordinate in the selected part of the complex number plane. A coordinate is part of the mandelbrot set if the series of numbers created by the iterations does converge. This is considered to be the case if either the absolute value of the current result is lower than 2 or no divergance has been determined after the maximum number of iteration.
With @numba.jit
the pure python function is transferred into native machine code speeding up the calculation a lot and also allowing to release the GIL for the calculations.
@numba.jit("uint64(complex128,uint64)", nopython=True, nogil=True)
def mandel(c, maxiterations):
z=c
for iteration in range(maxiterations):
if abs(z)>2.0:
return iteration
z=z**2+c
return maxiterations
The chunked_mandel
function below is the function that is actually being called by the workers on the several nodes. It receives a part of the array of complex numbers as its input and returns the number of iterations done to check if the coordinate is part of the Mandelbrot Set. This wrapper function helps to work on a splited input array and therefore supports distributing the work load.
The partial
helper is used to create a function with one parameter - the maximum number of iterations in this case - being fixed. This eases satisfying the calling convention for the mapping api we will be using later on.
mandel_p=partial(mandel, maxiterations=maxiterations)
def chunked_mandel(chunk):
return [mandel_p(c) for c in chunk]
The update_image
function is the callback function for datashaders InteractiveImage
, which is triggered on every zoom action for example. Within this function the current view of the Mandelbrot Set is calculated. To do this two arrays x
and y
reflecting the resolution of the plot (w
xh
) are created for the real and imaginary axis. Based on these axes the array of complex numbers c
is created covering the current selection of the complex plane that shall be visualized. To support distributed calculations a 1 dimensional view cr
on the complex array is created which is splitted into a list of chunks cc
afterwards, whith each chunk being one row of the original array c
. By doing so every worker will perform the calculations for one row in the complex plane at a time, which is triggered via the map
calls applying the chunked_mandel
function to each element in the list of chunks. The results can afterwards be collected via gather
in correct order.
The new resulting image is created using datashader
s interpolate
transfer function, which expects an xarray
data structure as input, which is itself created from the gathered results transformed into a numpy.array
via vertically stacking all intermediate results for the rows just caculated.
The color mapping has been chosen from blue
over red
and yellow
to black
for few to the maximum number of iterations, whilst the mapping function used for the colorization is logarithmic. Therefore the colors do not directly reflect the number of iterations, but give an impression of the order the resulting function for the Mandelbrot Set
calculation had after the last iteration. This colorization strategy is a usual concept for having smoother color transitions in the view.
def update_image(x_range, y_range, w, h):
xmin, xmax=x_range
ymin, ymax=y_range
x=np.linspace(xmin, xmax, w, dtype=np.float64)
y=np.linspace(ymin, ymax, h, dtype=np.float64)
c=x+y[:, None]*1j
cr=c.ravel()
cc=np.array_split(cr, h)
futures=e.map(chunked_mandel, cc)
dresults=e.gather(iter(futures))
iterations=list(dresults)
image=tf.interpolate(xr.DataArray(np.vstack(iterations)),
cmap=["blue", "red", "yellow", "black"],
how="log")
return image
Finally the figure to display the results is created. I've chosen a view spanning the whole width of the jupyter notebook, and selected controls for panning, zooming and resetting the view. The initial view coordinates define the initial axis ranges for the view and the look of the plot has been adjusted a bit.
After that the InteractiveImage
function is called with the plot p
and the update_image
function as its callback, allowing interactive exploration of the Mandelbrot Set
in the jupyter notebook.
tools="pan,wheel_zoom,box_zoom,reset"
p = figure(tools=tools, plot_width=980, plot_height=600,
x_range=(xmin, xmax), y_range=(ymin, ymax),
outline_line_color=None, background_fill_color='blue',
min_border=0, min_border_left=0, min_border_right=0, min_border_top=0, min_border_bottom=0)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
InteractiveImage(p, update_image)
Implementing an interactive Mandelbrot Set explorer using python, numba, bokeh, distributed and datashader seems to be a bit over the top, but results in a serious amount of fun. Of course explorers written in pure C++
and delivered as a standalone application are faster, but this implementation has its strengths as well. It is a working application, helping to understand several concepts of modern data visualization concepts and simply scales.