Dask DataFrames¶

Dask dataframes are blocked Pandas dataframes

Dask Dataframes coordinate many Pandas dataframes, partitioned along an index. They support a large subset of the Pandas API.

Start Dask Client for Dashboard¶

Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [ ]:

from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
client

Create Random Dataframe¶

We create a random timeseries of data with the following attributes:

It stores a record for every 10 seconds of the year 2000
It splits that year by month, keeping every month as a separate Pandas dataframe
Along with a datetime index it has columns for names, ids, and numeric values

In [ ]:

import dask.dataframe as dd
df = dd.demo.make_timeseries('2000-01-01', '2000-12-31', freq='10s', partition_freq='1M',
                             dtypes={'name': str, 'id': int, 'x': float, 'y': float})
df

In [ ]:

df.head(3)

In [ ]:

df.dtypes

Use Standard Pandas Operations¶

Most common Pandas operations operate identically on Dask dataframes

In [ ]:

df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3

Call .compute() when you want your result as a Pandas dataframe.

If you started Client() above then you may want to watch the status page during computation.

In [ ]:

df3.compute()

Persist data in memory¶

If you have the available RAM for your dataset then you can persist data in memory.

This allows future computations to be much faster.

In [ ]:

df = df.persist()

Time Series Operations¶

Because we have a datetime index time-series operations work efficiently

In [ ]:

%matplotlib inline

In [ ]:

df[['x', 'y']].resample('1w').mean().head()

In [ ]:

df[['x', 'y']].resample('1w').mean().compute().plot()

In [ ]:

df[['x', 'y']].rolling(window='7d').mean().head()

Random access is cheap along the index

In [ ]:

df['2000-05-05']

In [ ]:

%time df['2000-05-05'].compute()

Set Index¶

Data is sorted by the index column. This allows for faster access, joins, groupby-apply operations, etc.. However sorting data can be costly to do in parallel, so setting the index is both important to do, but only infrequently.

In [ ]:

df = df.set_index('name')
df

Because computing this dataset is expensive and we can fit it in our available RAM, we persist the dataset to memory.

In [ ]:

df = df.persist()

Dask now knows where all data lives, indexed cleanly by name. As a result oerations like random access are cheap and efficient

In [ ]:

%time df.loc['Alice'].compute()

Groupby Apply with Scikit-Learn¶

Now that our data is sorted by name we can easily do operations like random access on name, or groupby-apply with custom functions.

Here we train a different Scikit-Learn linear regression model on each name.

In [ ]:

from  sklearn.linear_model import LinearRegression

def train(partition):
    est = LinearRegression()
    est.fit(partition[['x']].values, partition.y.values)
    return est

In [ ]:

df.groupby('name').apply(train, meta=object).compute()