Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
Dask can scale up to your full laptop capacity and out to a cloud cluster.
In the following lines of code, we're reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don't worry about the code, this is just for a quick demonstration. We'll go over all of this in the next notebook. :)
Note for learners: This might be heavy for Binder.
Note for instructors: Don't forget to open the Dask Dashboard!
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
client
ddf = dd.read_parquet(
"s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
columns=["passenger_count", "tip_amount"],
storage_options={"anon": True},
)
result = ddf.groupby("passenger_count").tip_amount.mean().compute()
result
Dask provides multi-core and distributed+parallel execution on larger-than-memory datasets
We can think of Dask's APIs (also called collections) at a high and a low level:
Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:
In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:
Community libraries that have built-in dask integrations like:
Dask deployment libraries
... When we talk about the Dask project we include all these efforts as part of the community.
git clone http://github.com/dask/dask-tutorial
and then install necessary packages. There are three different ways to achieve this, pick the one that best suits you, and *only pick one option*. They are, in order of preference:
In the main repo directory
conda env create -f binder/environment.yml
conda activate dask-tutorial
You will need the following core libraries
conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension
Note that these options will alter your existing environment, potentially changing the versions of packages you already have installed.
Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.
Overview - dask's place in the universe.
Dataframe - parallelized operations on many pandas dataframes spread across your cluster.
Array - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.
Delayed - the single-function way to parallelize general python code.
Deployment/Distributed - Dask's scheduler for clusters, with details of how to view the UI.
Distributed Futures - non-blocking results that compute asynchronously.
Conclusion
If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is
Enter
to edit a cell (like this markdown cell)Esc
to change to command modeshift+enter
to execute a cell and move to the next cell.The toolbar has commands for executing, converting, and creating cells.
Hello, world!
¶Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.
Print the text "Hello, world!".
# Your code here
The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell, in case later sections of the notebook depend on the output from the solution.
print("Hello, world!")
dask
tag on Stack Overflow, for usage questions