#!/usr/bin/env python # coding: utf-8 #
#
#

Beyond pandas:
The great Python dataframe showdown

#

Juan Luis Cano Rodríguez <juanlu@orchest.io>
2022-10-02 @ PyConES 2022 Granada

#
#
# ## Abstract # # The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays there are several open-source projects that claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars). # # The outline of the talk goes as follows: # # Short introduction to the importance of pandas, and brief recollection of its main pain points (5 minutes) # Enumeration of some alternatives, description of our classification (pandas-like vs bespoke, single-node vs distributed) (5 minutes) # Presentation of the libraries using brief code snippets, visualization of the dependency relationships between them (20 minutes) # Recommendations and conclusions (5 minutes) # # After the talk, you will have more information on how some of the modern alternatives to pandas fit onto the ecosystem, understand which ones provide the easiest migration path for an existing codebase, and be more prepared to judge which one to use for your next project. Prior exposure to pandas will help make the most of the presentation. # # Outline # # 1. Intro # 2. pandas: success and limitations # 3. The alternatives # 4. Demo # 5. Conclusions # # Who is this guy? # # - Aerospace Engineer on a mission to accelerate the **Solidarity Economy** through technology ♻️ # - **Data Scientist Advocate** at Orchest, an open source pipeline orchestrator 🥑 # - Organizer of the **PyData Madrid** monthly meetup (ex Python España, ex PyCon Spain) 🐍 # - Contributor to the SciPy and PyData ecosystem # - Hard Rock lover 🎸 # # Follow me! https://github.com/astrojuanlu/, https://astrojuanlu.substack.com # # ![Me!](img/juanlu-orchest.jpg) # # Data pipelines and dataframes # # - At **Orchest** we want to _empower data scientists_ by developing an easy to use, scalable orchestrator for data pipelines # - Extract + Load is being commoditized, Transform is highly customized and potentially complex # - **"Am I using the best tools available?"** # - Focus: dataframe libraries # # ![Pipeline](img/pipeline.png) # # pandas: success and limitations # # pandas is everywhere! # # ![pandas growth](img/pandas-growth.png) # # https://stackoverflow.blog/2017/09/14/python-growing-quickly/ # # pandas: success and limitations # # "Apache Arrow and the 10 things I hate about pandas" https://wesmckinney.com/blog/apache-arrow-pandas-internals/ # # tl;dr: # # 1. Many pandas operations don't take advantage of multiple cores or query planning # - Eager evaluation # - Intermediate objects # - Mixed success with GIL release # 2. Lousy memory management # - Handling of missing data is inconsistent # - No memory-mapping # - Strings and categories are inefficient # # The alternatives # ![pandas alternatives](img/dataframes-charming-quadrangle.png) # ## Apache Arrow # # ![Apache Arrow logo](img/apache-arrow.png) # # https://arrow.apache.org/ # # "A language-independent columnar memory format" # # - Designed for ephemeral, or transient, in-memory storage # - Two formats: **Streaming** and **Random Access** # - To serialize the latter to disk: **Apache Parquet** and **Feather** # - **Immutable**! # - Memory-mapping for fast data processing # - Python bindings (on top of C++), many others # - Not exactly an alternative, but rather a foundation to create them # ## Vaex # # ![Vaex logo](img/vaex.png) # # https://vaex.io/ # # Out-of-core dataframes "to visualize and explore big tabular datasets" # # - Can process **files larger than RAM** thanks to **out-of-core** capabilities # - Initially built around HDF5, now has Parquet support as well # - Rich **visualization** functionality for point clouds (similar to datashader) # - API similar to pandas, with some deviations # - Expression system delays computation (arguably less powerful than Polars') # ## Polars # # ![Polars logo](img/polars.png) # # http://pola.rs/ # # "Lightning-fast", in-memory dataframes for Rust and Python # # - Uses the Arrow memory format for its columnar storage # - **Eager** and **lazy** modes, even for I/O ("scanning") # - **Expressions** ($\mathcal{F}(\text{Series}) \rightarrow \text{Series}$) are decoupled from the computation itself # - Chains of expressions build an optimized **query plan** # - Powerful row-wise and list-column capabilities # - Young, but very promising # ## High performance pandas # # - **cuDF** https://github.com/rapidsai/cudf pandas on GPU # - **Dask** https://dask.org/ distributed pandas (also: Mars, Ray) # - **Modin** https://github.com/modin-project/modin/ transparent pandas acceleration # - **PySpark** https://spark.apache.org/docs/latest/api/python/index.html they've been there this whole time # # Demo time! # # ![Hacker](img/hacker.gif) # # What about the index? # # From the Polars docs: # # > Indexes are not needed! Not having them makes things easier - convince us otherwise! # # From Twitter: # # ![James Powell on pandas](img/james-powell-pandas.png) # # ([Original thread](https://mobile.twitter.com/dontusethiscode/status/1530182008336424963)) # # Conclusions # # - pandas has some limitations and inconsistencies, and some projects are offering alternatives # - Drop-in replacements are bound to imitate pandas API inconsistencies as well # - Theoretically, certain kinds of operations would be less efficient without an Index - do we care? # - Alternatives look promising, but they are not made for distributed computing (yet?) # # Thanks! # # - 📬 juanlu@orchest.io # - 🐦 [@juanluisback](https://twitter.com/juanluisback) # # (Thanks for Kevin Kho, James Powell, Cameron Riddell, Ritchie Vink for inspiration and discussion) # # ![pandas alternatives](img/dataframes-charming-quadrangle-scaled.png)