First we create some artificial data to work with. This creates a couple csvs in a directory, a common situation.
from utils import accounts_csvs
accounts_csvs(3, 1000000, 500)
import os, glob
filenames = os.path.join('data', 'accounts.*.csv')
print('Files created:\n%s' % '\n'.join(glob.glob(filenames)))
Dask dataframes can be created in many different methods. For more information, see the docs here.
Here we'll be using dd.read_csv
. This looks almost exactly like the pandas.read_csv
function, with a few differences:
accounts.*.csv
) instead of just a single filenameimport dask.dataframe as dd
df = dd.read_csv(filenames)
df
df.head()
%time len(df)
# Pretty print out the Dataframe methods
import textwrap
print('\n'.join(textwrap.wrap(str([f for f in dir(df) if not f.startswith('_')]))))
df.dtypes
df.divisions
df.npartitions
df.visualize()
df.amount.mean().compute()
df.groupby(df.names).amount.mean().compute()
In Pandas, the index associates a value to each record/row of your data. Operations that align with the index, like loc can be a bit faster as a result.
In dask.dataframe this index becomes even more important. A Dask DataFrame consists of several Pandas DataFrames. These dataframes are separated along the index by value. For example, when working with time series we may partition our large dataset by month.
By partitioning our data semantically (e.g. by Month) rather than fixed sizes (as in dask.array
), we can be more efficient in operations that select along the index. For example loc
along a partitioned index will only need to look at the single partition that contains the requested data, as dask can infer which partition contains the value from the divisions. Without divisions, all partitions need to be inspected, as dask has no idea which partition contains the value.
If the divisions are unknown, all the values in .divisions
will be None.
df.divisions
df.known_divisions
However if we set the index to some new column then dask will divide our data roughly evenly along that column and create new divisions for us. Warning, set_index triggers immediate computation.
%time df2 = df.set_index('names')
df2.divisions
df2.known_divisions
Operations like loc only need to load the relevant partitions
df2.loc['Edith'].amount.mean().compute()