Apache Parquet is a great default choice for a data serialization format in data processing and machine learning pipelines, but just because it's available in many environments doesn't mean it has the same behavior everywhere. In the remaining discussion, we'll look at some potential interoperability headaches and show how to work around them.
We'll start by looking at a Parquet file generated by Apache Spark with the output of an ETL job.
from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()
We can look at the schema for this file and inspect a few rows:
spark_df = session.read.parquet("berlin_payments.parquet")
spark_df.limit(10).toPandas()
The "file" we're reading from (berlin_payments.parquet
) is a partitioned Parquet file, so it's really a directory. We can inspect the Parquet metadata for each column using the parquet-tools
utility; this shows us that many of our columns are strings and that string columns are universally dictionary-encoded, which means that each string is stored as an index into a dictionary rather than as a literal value. By storing values that may be repeated many times in this way, we save space and compute time. (Parquet defaults to dictionary-encoding small-cardinality string columns, and we can assume that many of these will be treated as categoricals later in a data pipeline.)
!parquet-tools meta berlin_payments.parquet 2>& 1 | head -30 | grep SNAPPY
So far, so good. But what happens when we read these data into pandas? We can load Parquet files into pandas if we have PyArrow installed; let's try it out.
import pandas as pd
pandas_df = pd.read_parquet("berlin_payments.parquet")
pandas_df
The data look about like we'd expect them to. However, when we look at how pandas is representing our data, we're in for a nasty surprise: pandas has taken our efficiently dictionary-encoded strings and represented them with arbitrary Python objects!
pandas_df.dtypes
We could convert each column to strings and then to categoricals, but this would be tedious and it is also certainly wasteful in terms of time and space. (Note that if we'd created a pandas table with string
or categorical
dtypes
and saved that to Parquet, the types would survive a round-trip to disk because they'd be stored in pandas-specific Parquet metadata.)
In this case, pandas is using the PyArrow Parquet backend; interestingly, if we use PyArrow directly and read into a pyarrow.Table
, the string types are preserved:
import pyarrow.parquet as pq
arrow_table = pq.read_table("berlin_payments.parquet/")
...but once we convert that table to pandas, we've lost the type information.
arrow_table.to_pandas().dtypes
However, we can force PyArrow to preserve the dictionary encoding even through the pandas conversion if we specify the read_dictionary
option with a list of appropriate columns:
dict_arrow_table = pq.read_table("berlin_payments.parquet/", read_dictionary=['neighborhood'])
dict_arrow_table
dict_arrow_table.to_pandas().dtypes
If we don't know a priori what columns are dictionary-encoded (and might hold categoricals), we can find out by programmatically inspecting the Parquet metadata:
dictionary_cols = set([])
# get metadata for each partition
for piece in pq.ParquetDataset("berlin_payments.parquet", use_legacy_dataset=False).pieces:
meta = piece.metadata
# get column names
cols = enumerate(meta.schema.names)
# get column metadata for each row group
for i in range(meta.num_row_groups):
rg = meta.row_group(i)
for col, colname in cols:
if "PLAIN_DICTIONARY" in rg.column(col).encodings:
dictionary_cols.add(colname)
dictionary_cols
Preserving column types when transferring data from a JVM-based ETL pipeline to a Python-based machine learning pipeline can save a lot of human and computational effort -- and eliminate an entire class of performance regressions and bugs as well. Fortunately, it just takes a little bit of care to ensure that our entire pipeline preserves the efficiency advantages of Parquet.