This notebook investigates performance parallelizing scikit-image operations. We find that Dask provides a very small speedup, despite consuming 400% CPU. We benchmark various components of the computation and find that we are likely bound by something other than CPU processing, like memory access.

This work was done in collaboration with the following scikit-image developers at the SciPy 2017 sprints

- Emmanuelle Gouillart
- Stefan Van der Walt
- Nelle Varoquaux

This data is courtesy of Emmanuelle Gouillart: 250MB download

In [1]:

```
import numpy as np
dat = np.fromfile('Al8Cu_1000_g13_t4_200_250.vol',
dtype=np.float32).reshape((50, 1104, 1104))
dat /= abs(dat).max()
```

In [2]:

```
dat.shape
```

Out[2]:

(50, 1104, 1104)

In [3]:

```
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 10))
plt.imshow(dat[40, :, :], cmap='gray')
```

Out[3]:

<matplotlib.image.AxesImage at 0x7f55f84ff710>

Lets look at a common operation, smoothing the image with a gaussian filter. We find that calling this normally on the image takes around 2.5 seconds.

In [4]:

```
import skimage.filters
%time _ = skimage.filters.gaussian(dat, sigma=3)
```

CPU times: user 2.24 s, sys: 64 ms, total: 2.3 s Wall time: 2.3 s

We split this array into four chunks in the large spatial dimensions, leaving the short dimension (50) full.

We then map the `skimage.filters.gaussian`

function on each block, providing an overlap of 9, which should provide enough space for the gaussian filter to be smooth over chunk transitions.

In [5]:

```
import dask.array as da
```

In [6]:

```
%%time
x = da.from_array(dat, chunks=(50, dat.shape[1] // 2, dat.shape[2] // 2), name=False)
y = x.map_overlap(skimage.filters.gaussian, depth=9, sigma=3, boundary='none')
```

CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 5.18 ms

In [7]:

```
y
```

Out[7]:

dask.array<trim, shape=(50, 1104, 1104), dtype=float64, chunksize=(50, 552, 552)>

In [8]:

```
%time _ = y.compute()
```

CPU times: user 5.49 s, sys: 668 ms, total: 6.16 s Wall time: 1.94 s

We see a modest speedup (20-30%) which is far below the 300% speedup we might expect on our four-core machine.

The rest of this notebook tries to identify the burdens involved in this computation.

In [9]:

```
%%time
x = da.from_array(dat, chunks=(50, dat.shape[1] // 2, dat.shape[2] // 2), name=False)
y = x.map_overlap(skimage.filters.gaussian, depth=9, sigma=3, boundary='none')
```

CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 6.46 ms

There are about 100 tasks for four chunks. We can probably bring this down.

In [10]:

```
len(y.dask)
```

Out[10]:

109

In [11]:

```
y.visualize(optimize_graph=True)
```

Out[11]: