IPython is a tool for interactive and exploratory computing. We have seen that IPython's kernel provides a mechanism for interactive remote computation, and we have extended this same mechanism for interactive remote parallel computation, simply by having multiple kernels.
At a high level, there are three basic components to parallel IPython:
These components live in the IPython.parallel
package and are installed with IPython.
This leads to a usage model where you can create as many engines as you want per compute node, and then control them all from your clients via a central 'controller' object that encapsulates the hub and schedulers:
The Engine is simply a remote Python namespace where your code is run, and is identical to the kernel used elsewhere in IPython.
It can do all the magics, pylab integration, and everything else you can do in a regular IPython session.
The Controller is a collection of processes:
Together, these processes provide a single connection point for your clients and engines. Each Scheduler is a small GIL-less function in C provided by pyzmq (the Python load-balancing scheduler being an exception).
The Hub can be viewed as an über-logger, which monitors all communication between clients and engines, and can log to a database (e.g. SQLite or MongoDB) for later retrieval or resubmission. The Hub is not involved in execution in any way, and a slow Hub cannot slow down submission of tasks.
All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines.
There is one primary object, the Client
, for connecting to a cluster.
For each execution model there is a corresponding View
,
and you determine how your work should be executed on the cluster by creating different views
or manipulating attributes of views.
The two basic views:
DirectView
class for explicitly running code on particular engine(s).LoadBalancedView
class for destination-agnostic scheduling.You can use as many views of each kind as you like, all at the same time.
The quickest way to get started is to visit the 'clusters' tab, and start some engines with the 'default' profile.
To follow along with this tutorial, you will need to start the IPython
controller and some IPython engines. The simplest way of doing this is
visit the 'clusters' tab,
and start some engines with the 'default' profile,
or to use the ipcluster
command:
$ ipcluster start -n 4
There isn't time to go into it here, but ipcluster can be used to start engines and the controller with various batch systems including:
More information on starting and configuring the IPython cluster in the IPython.parallel docs.
Once you have started the IPython controller and one or more engines, you are ready to use the engines to do something useful.
To make sure everything is working correctly, let's do a very simple demo:
from IPython import parallel
rc = parallel.Client()
rc.block = True
rc.ids
[0, 1, 2, 3]
Let's define a simple function
def mul(a,b):
return a*b
mul(5,6)
30
dview = rc[:]
dview
<DirectView [0, 1, 2, 3]>
e0 = rc[0]
e0
<DirectView 0>
What does it look like to call this function remotely?
Just turn f(*args, **kwargs)
into view.apply(f, *args, **kwargs)
!
e0.apply(mul, 5, 6)
30
And the same thing in parallel?
dview.apply(mul, 5, 6)
[30, 30, 30, 30]
Python has a builtin map for calling a function with a sequence of arguments
map(mul, range(1,10), range(2,11))
[2, 6, 12, 20, 30, 42, 56, 72, 90]
So how do we do this in parallel?
dview.map(mul, range(1,10), range(2,11))
[2, 6, 12, 20, 30, 42, 56, 72, 90]
We can also run code in strings with execute
dview.execute("import os")
dview.execute("a = os.getpid()")
<AsyncResult: finished>
And treat the view as a dict lets you access the remote namespace:
dview['a']
[19511, 19525, 19526, 19512]
When you do async execution, the calls return an AsyncResult object immediately
def wait_here(t):
import time, os
time.sleep(t)
return os.getpid()
ar = dview.apply_async(wait_here, 2)
ar.get()
[19511, 19525, 19526, 19512]