The Planetary Computer Hub is a JupyterHub paired with Dask Gateway for easily creating Dask clusters to distribute your computation on a cluster of machines.
Don't forget the client = cluster.get_client() line. That's what actually ensures the cluster will be used for computations using Dask. Otherwise, you'll end up using Dask's local scheduler. This will run the computation on using multiple threads on a single machine, rather than the cluster. When you're using a cluster, make sure to always use the Dashboard (more below). If you aren't seeing any tasks in the dashboard, you might have forgotten to create a Dask client.
The Dask Dashboard provides invaluable information on the activity of your cluster. Clicking the "Dashboard" link above will open the Dask dashboard a new browser tab.
We also include the dask-labextension for laying out the Dask dashboard as tabs in the Jupyterlab workspace.
To using the dask-labextension, copy the "Dashboard" address from the cluster repr, click the orange Dask logo on the lefthand navigation bar, and paste the dashboard address
You can close your cluster, freeing up its resources, by calling cluster.close().
Dask will add workers as necessary when a computation is submitted. As an example, we'll compute the minimum daily temperature averaged over all of Hawaii, using the Daymet dataset.
dask_gateway.GatewayCluster creates a cluster with some default settings, which might not be appropriate for your workload. For example, we might have a memory-intensive workload which requires more memory per CPU core. Or we might need to set environment variables on the workers.
To customize your cluster, create a Gateway object and then customize the options.
In a Jupyter Notebook, you can use the HTML widget to customize the options. Or using Python you can adjust the values programmatically. We'll ask for 16GiB of memory per worker.
In [8]:
cluster_options["worker_memory"]=16
Now create your cluster. Make sure to pass the cluster_options object to gateway.new_cluster.
The Dask documentation has much more information on using Dask for scalable computing. This JupyterHub deployment uses Dask Gateway to manage creating Dask clusters.