Dask Dataframe¶

This notebook gives a quick demo of using dask.dataframe. It is not intended to be a full tutorial on using dask, or a full demonstration of its capabilities. For more information see the docs here, or the tutorial here.

Create some artificial data¶

First we create some artificial data to work with. This creates a couple csvs in a directory, a common situation.

In [ ]:

from utils import accounts_csvs
accounts_csvs(3, 1000000, 500)

import os, glob
filenames = os.path.join('data', 'accounts.*.csv')
print('Files created:\n%s' % '\n'.join(glob.glob(filenames)))

Creating Dask Dataframes¶

Dask dataframes can be created in many different methods. For more information, see the docs here.

Here we'll be using dd.read_csv. This looks almost exactly like the pandas.read_csv function, with a few differences:

Can pass in a globstring (e.g. accounts.*.csv) instead of just a single filename
Takes in a few (optional) parameters to control the partitioning
Also works well with HDFS and S3

In [ ]:

import dask.dataframe as dd

df = dd.read_csv(filenames)
df

In [ ]:

df.head()

In [ ]:

%time len(df)

Dask Dataframes look like Pandas Dataframes¶

In [ ]:

# Pretty print out the Dataframe methods
import textwrap
print('\n'.join(textwrap.wrap(str([f for f in dir(df) if not f.startswith('_')]))))

In [ ]:

df.dtypes

In [ ]:

df.divisions

In [ ]:

df.npartitions

In [ ]:

df.visualize()

Example computations¶

Mean amount¶

In [ ]:

df.amount.mean().compute()

Mean amount per account¶

In [ ]:

df.groupby(df.names).amount.mean().compute()

Divisions¶

In Pandas, the index associates a value to each record/row of your data. Operations that align with the index, like loc can be a bit faster as a result.

In dask.dataframe this index becomes even more important. A Dask DataFrame consists of several Pandas DataFrames. These dataframes are separated along the index by value. For example, when working with time series we may partition our large dataset by month.

By partitioning our data semantically (e.g. by Month) rather than fixed sizes (as in dask.array), we can be more efficient in operations that select along the index. For example loc along a partitioned index will only need to look at the single partition that contains the requested data, as dask can infer which partition contains the value from the divisions. Without divisions, all partitions need to be inspected, as dask has no idea which partition contains the value.

If the divisions are unknown, all the values in .divisions will be None.

In [ ]:

df.divisions

In [ ]:

df.known_divisions

However if we set the index to some new column then dask will divide our data roughly evenly along that column and create new divisions for us. Warning, set_index triggers immediate computation.

In [ ]:

%time df2 = df.set_index('names')

In [ ]:

df2.divisions

In [ ]:

df2.known_divisions

Operations like loc only need to load the relevant partitions

In [ ]:

df2.loc['Edith'].amount.mean().compute()