(Didn't know this was a thing before I googled "Panda vs Polar" accidently, Credits: Panda & Polar Bear)
from datetime import datetime
import numpy as np
import polars as pl
from pathlib import Path
from polars import col, lit
import pandas as pd
from pandas.io.common import get_handle
pl.__version__
'0.7.18'
pd.__version__
'1.2.4'
# Download a huge csv as a test. Takes a while and only needed once...
big_csv = Path("./big.csv")
csv_url = "http://sdm.lbl.gov/fastbit/data/star2002-full.csv.gz"
if not big_csv.exists():
with get_handle(csv_url, compression="gzip", mode="r") as fh_in, open(big_csv, "bw") as fh_out:
fh_out.write(fh_in.handle.buffer.read())
edf = pl.read_csv(str(big_csv), has_headers=False)
edf.filter(col("column_1") == 1).select(["column_9"]).head()
column_9 |
---|
i64 |
654 |
61 |
7 |
27 |
1 |
edf[edf["column_1"] == 1][["column_9"]].head()
column_9 |
---|
i64 |
654 |
61 |
7 |
27 |
1 |
Why shouldn't I use the Pandas style? Because ...
df
all the time. Check out this filtering example: agg_metric_df[agg_metric_df["metric_1"] < 0.9]
. Using col
to refer to the column of the current dataframe is much cleaner,Just switching read_csv
to scan_csv
is all it needs to go from eager to lazy in this example. collect
or fetch
is then used to trigger the execution.
ldf = pl.scan_csv(str(big_csv), has_headers=False)
ldf = ldf.filter(col("column_1") == 1)
ldf.select(["column_9"]).collect().head()
column_9 |
---|
i64 |
654 |
61 |
7 |
27 |
1 |
Pandas style fails in lazy mode:
ldf = pl.scan_csv(str(big_csv), has_headers=False)
ldf[ldf["column_1"] == 1][["column_9"]].head()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-10-aaa4e676d17c> in <module> 1 ldf = pl.scan_csv(str(big_csv), has_headers=False) ----> 2 ldf[ldf["column_1"] == 1][["column_9"]].head() TypeError: 'LazyFrame' object is not subscriptable
Slicing and indexing in Polars works with the help of the subscript syntax similar to Numpy, i.e. df[1] or df[1, 2]. Some simple rules apply:
indexing by a single dimension
df[42]
, df[42:]
,df["my_col"]
, df[["col1", "col2]]
,indexing by two dimensions
df[69, 42]
or df[69, "col_42"]
In case of integers also slices, e.g. 1:
, are possible.
edf[1] # this is a bug right now
Series: 'column_2' [i64] [ 1613423 1613423 1613423 1613423 1613423 1613423 1613423 1613423 1613423 1613423 ]
edf[[1]]
column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11 | column_12 | column_13 | column_14 | column_15 | column_16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | f64 | f64 | f64 |
1 | 1613423 | 808 | 2.0011015223e7 | 1613424 | 886 | 0 | 0 | 61 | 371 | 2.0011204115e7 | 23.326 | 2288071 | -2.47e-1 | 0.456 | 57.811 |
edf[1, 3] # index by (row, column)
20011015.222604
edf[1, "column_4"] # or as string
20011015.222604
edf[1, [2, 3]] # index by (row, column) but returns data frame
column_3 | column_4 |
---|---|
i64 | f64 |
808 | 2.0011015223e7 |
edf[1:4, "column_4":] # slice by row and column's name
column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11 | column_12 | column_13 | column_14 | column_15 | column_16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
f64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | f64 | f64 | f64 |
2.0011015223e7 | 1613424 | 886 | 0 | 0 | 61 | 371 | 2.0011204115e7 | 23.326 | 2288071 | -2.47e-1 | 0.456 | 57.811 |
2.0011015223e7 | 1613424 | 638 | 0 | 0 | 7 | 121 | 2.0011204115e7 | 2.444 | 2288071 | -3.91e-1 | 0.59 | 167.757 |
2.0011015223e7 | 1613424 | 4259 | 0 | 0 | 1024 | 1302 | 2.0011204115e7 | 9.522 | 2288071 | -2.9e-1 | 0.446 | 8.644 |
Since in Pandas there is an explicit index that can be any type, not just integer and columns that can have any immutable datatype, it has to workaround several ambiguities with special accessors like iloc
, loc
, at
, iat
, etc.
pdf = edf.to_pandas()
pdf.iloc[1, 3]
20011015.222604
# when mixing indexing by integer and by string it gets less comprehensible in Pandas
pdf["column_4"].iloc[1]
20011015.222604
pdf.iloc[1, [2, 3]]
column_3 8.080000e+02 column_4 2.001102e+07 Name: 1, dtype: float64
pdf[["column_4"]].iloc[1:4]
column_4 | |
---|---|
1 | 2.001102e+07 |
2 | 2.001102e+07 |
3 | 2.001102e+07 |
# for slicing with column names and guaranteed indexing by integer we have to write:
pdf.loc[:, "column_4":].iloc[1:4]
# `pdf.loc[1:4, "column_4":]` works only as long the index is set correctly.
column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11 | column_12 | column_13 | column_14 | column_15 | column_16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2.001102e+07 | 1613424 | 886 | 0 | 0 | 61 | 371 | 2.001120e+07 | 23.326479 | 2288071 | -0.247330 | 0.455916 | 57.810596 |
2 | 2.001102e+07 | 1613424 | 638 | 0 | 0 | 7 | 121 | 2.001120e+07 | 2.444299 | 2288071 | -0.390961 | 0.589534 | 167.757140 |
3 | 2.001102e+07 | 1613424 | 4259 | 0 | 0 | 1024 | 1302 | 2.001120e+07 | 9.521868 | 2288071 | -0.290154 | 0.446027 | 8.644362 |
left_df = pl.DataFrame({"a": [1, 2, 3], "b": [None, "b", "c"]})
right_df = pl.DataFrame({"a": [1, 2], "c": [42, 69]})
df = left_df.join(right_df, on="a", how="left")
df
a | b | c |
---|---|---|
i64 | str | i64 |
1 | null | 42 |
2 | "b" | 69 |
3 | "c" | null |
Note that the last element of the c
column is null
, not NaN
as in Pandas, and the datatype is still int and not automatically converted to float as in Pandas.
df.filter(col("c").is_null())
a | b | c |
---|---|---|
i64 | str | i64 |
3 | "c" | null |
Pandas does something pretty scary here
left_pdf = left_df.to_pandas()
right_pdf = right_df.to_pandas()
Note that "c"-column has type int:
right_pdf.dtypes
a int64 c int64 dtype: object
pdf = pd.merge(left_pdf, right_pdf, on="a", how="left")
pdf
a | b | c | |
---|---|---|---|
0 | 1 | None | 42.0 |
1 | 2 | b | 69.0 |
2 | 3 | c | NaN |
Depending on the datatype, Pandas shows None
or NaN
, also note that the column c
was converted from int
to float
without our consent!
df.with_column((lit(3)*col("c")).alias("3*c"))
a | b | c | 3*c |
---|---|---|---|
i64 | str | i64 | i64 |
1 | null | 42 | 126 |
2 | "b" | 69 | 207 |
3 | "c" | null | null |
Same is possible in Pandas but note that we have to retype again the variable name pdf
just to access a column!
pdf.assign(**{"3*c": 3*pdf["c"]})
a | b | c | 3*c | |
---|---|---|---|---|
0 | 1 | None | 42.0 | 126.0 |
1 | 2 | b | 69.0 | 207.0 |
2 | 3 | c | NaN | NaN |
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
df
nrs | names | random | groups |
---|---|---|---|
i64 | str | f64 | str |
1 | "foo" | 0.27 | "A" |
2 | "ham" | 0.521 | "A" |
3 | "spam" | 0.825 | "B" |
null | "egg" | 0.621 | "C" |
5 | null | 0.891 | "B" |
# and in Pandas
pdf = df.to_pandas()
df.select(
[
pl.sum("nrs"), # or equivalently col("nrs").sum()
col("names").sort(),
col("names").n_unique().alias("unique_names_1"),
]
)
nrs | names | unique_names_1 |
---|---|---|
i64 | str | u32 |
11 | null | 5 |
11 | "egg" | 5 |
11 | "foo" | 5 |
11 | "ham" | 5 |
11 | "spam" | 5 |
In Pandas we create a new DataFrame and reference several times pdf
pd.DataFrame(
{
"nrs": pdf["nrs"].sum(),
"names": pdf["names"].sort_values(),
"unique_names_1": pdf["names"].nunique(dropna=False),
}
)
nrs | names | unique_names_1 | |
---|---|---|---|
3 | 11.0 | egg | 5 |
0 | 11.0 | foo | 5 |
1 | 11.0 | ham | 5 |
2 | 11.0 | spam | 5 |
4 | 11.0 | None | 5 |
df.select(col("names").filter(col("random") > 0.4))
names |
---|
str |
"ham" |
"spam" |
"egg" |
null |
Syntax in Pandas is way less readable
pdf.loc[pdf["random"] > 0.4][["names"]]
names | |
---|---|
1 | ham |
2 | spam |
3 | egg |
4 | None |
Another way in Pandas is to use the query style:
pdf.query("random > 0.4")[["names"]]
names | |
---|---|
1 | ham |
2 | spam |
3 | egg |
4 | None |
The problem with Pandas' query style is that it is basically string hacking with no checks whatsoever until the code is executed. Also reuse of certain expressions is highly limited if you have just strings.
All expressions in Polars are embarassingly parallel by design and thus automatically parallelized
df.select(
[
(pl.sum("nrs") * pl.when(col("random") > 0.5)
.then(0)
.otherwise(col("random"))
).alias("result")
]
)
result |
---|
f64 |
2.973 |
0.0 |
0.0 |
0.0 |
0.0 |
SQL-like when
/then
/otherwise
statements are not possible in Pandas, thus we have to use np.where
:
pd.Series(
np.where(pdf["random"] > 0.5, 0, pdf["random"] * pdf["nrs"].sum()), name="result"
).to_frame()
result | |
---|---|
0 | 2.973213 |
1 | 0.000000 |
2 | 0.000000 |
3 | 0.000000 |
4 | 0.000000 |
# or easier to read but much slower since not vectorized at all
pdf.random.apply(lambda x: 0 if x > 0.5 else x * pdf["nrs"].sum()).to_frame()
random | |
---|---|
0 | 2.973213 |
1 | 0.000000 |
2 | 0.000000 |
3 | 0.000000 |
4 | 0.000000 |
df.select(
[
col("*"), # select all
col("random").sum().over("groups").alias("sum[random]/groups"),
col("random").list().over("names").alias("random/name"),
]
)
nrs | names | random | groups | sum[random]/groups | random/name |
---|---|---|---|---|---|
i64 | str | f64 | str | f64 | list |
1 | "foo" | 0.27 | "A" | 0.791 | "[0.2702920925095984]" |
2 | "ham" | 0.521 | "A" | 0.791 | "[0.5210243165448514]" |
3 | "spam" | 0.825 | "B" | 1.716 | "[0.8248294839356618]" |
null | "egg" | 0.621 | "C" | 0.621 | "[0.6213192138749672]" |
5 | null | 0.891 | "B" | 1.716 | "[0.8909539423562371]" |
Doing the same in Pandas is a bit more complex. Also note that there is an unexpected NaN
in the last row. This is due to the fact that when inserting pdf.groupby(['names'], dropna=False)['random'].apply(list)
we compare NaN
to NaN
which is false by definition. This is just another subtle problem caused by the fact that Pandas uses NaN
to express NA
.
Also note that Polars needs no explicit index like Pandas to do operations like this, just like Spark has no way to set an index explicitely.
(pdf.set_index("groups")
.assign(**{"sum[random]/groups": pdf.groupby(['groups'])['random'].sum()})
.set_index("names")
.assign(**{"random/name": pdf.groupby(['names'], dropna=False)['random'].apply(list)})
.reset_index()
)
names | nrs | random | sum[random]/groups | random/name | |
---|---|---|---|---|---|
0 | foo | 1.0 | 0.270292 | 0.791316 | [0.2702920925095984] |
1 | ham | 2.0 | 0.521024 | 0.791316 | [0.5210243165448514] |
2 | spam | 3.0 | 0.824829 | 1.715783 | [0.8248294839356618] |
3 | egg | NaN | 0.621319 | 0.621319 | [0.6213192138749672] |
4 | None | 5.0 | 0.890954 | 1.715783 | NaN |
# or alternatively using `join` which is also avoiding the NaN problem
pdf.join(
pdf.groupby('groups').random.sum().rename("sum[random]/groups"), on="groups"
).join(
pdf.groupby('names', dropna=False).random.apply(list).rename("random/name"), on="names"
)
nrs | names | random | groups | sum[random]/groups | random/name | |
---|---|---|---|---|---|---|
0 | 1.0 | foo | 0.270292 | A | 0.791316 | [0.2702920925095984] |
1 | 2.0 | ham | 0.521024 | A | 0.791316 | [0.5210243165448514] |
2 | 3.0 | spam | 0.824829 | B | 1.715783 | [0.8248294839356618] |
3 | NaN | egg | 0.621319 | C | 0.621319 | [0.6213192138749672] |
4 | 5.0 | None | 0.890954 | B | 1.715783 | [0.8909539423562371] |
df = pl.read_csv("https://theunitedstates.io/congress-legislators/legislators-current.csv")
pdf = df.to_pandas()
(df.lazy() # allows for working only on a subset using limit
.groupby("first_name")
.agg(
[
col("party").count().alias("n_party"), # renaming an aggregated column is a bliss
col("gender").list(),
col("last_name").first(),
]
)
.sort("n_party", reverse=True)
.limit(5)
.collect()
)
first_name | n_party | gender_agg_list | last_name_first |
---|---|---|---|
str | u32 | list | str |
"John" | 19 | "[M, M, ... M]" | "Barrasso" |
"Mike" | 12 | "[M, M, ... M]" | "Kelly" |
"Michael" | 11 | "[M, M, ... M]" | "Bennet" |
"David" | 11 | "[M, M, ... M]" | "Cicilline" |
"James" | 9 | "[M, M, ... M]" | "Inhofe" |
Note how easily we can deal with lists of strings by aggregating over gender using list()
.
In Pandas the same operation feels more like string hacking and renaming happens as a separate step having unnecessary repetitions of the column names. Everything is of course eagerly evaluated.
(pdf.groupby("first_name")
.agg({"party": "count",
"gender": lambda grp: grp.to_list(),
"last_name": "first"})
.rename(columns={"party": "n_party",
"gender": "gender_agg_list",
"last_name": "last_name_first"})
.sort_values(by="n_party", ascending=False)
.reset_index()
.head(5))
first_name | n_party | gender_agg_list | last_name_first | |
---|---|---|---|---|
0 | John | 19 | [M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, ... | Barrasso |
1 | Mike | 12 | [M, M, M, M, M, M, M, M, M, M, M, M] | Kelly |
2 | Michael | 11 | [M, M, M, M, M, M, M, M, M, M, M] | Bennet |
3 | David | 11 | [M, M, M, M, M, M, M, M, M, M, M] | Cicilline |
4 | James | 9 | [M, M, M, M, M, M, M, M, M] | Inhofe |
(df.lazy()
.groupby("state")
.agg(
[
(col("party") == "Democrat").sum().alias("demo"),
(col("party") == "Republican").sum().alias("rep"),
]
)
.sort("demo", reverse=True)
.limit(5)
.collect()
)
state | demo | rep |
---|---|---|
str | u32 | u32 |
"CA" | 44 | 11 |
"NY" | 21 | 8 |
"IL" | 15 | 5 |
"TX" | 13 | 24 |
"NJ" | 12 | 2 |
The translation to Pandas is "somewhat" more complicated... kudos to everyone able to solve this without looking it up on StackOverflow :-)
(pdf.groupby("state")
.agg({"party": [("demo", lambda grp: np.sum(grp == "Democrat")),
("rep", lambda grp: np.sum(grp == "Republican"))]})
.droplevel(0, axis=1)
.reset_index()
.sort_values(by="demo", ascending=False)
.head(5)
)
state | demo | rep | |
---|---|---|---|
5 | CA | 44 | 11 |
37 | NY | 21 | 8 |
16 | IL | 15 | 5 |
47 | TX | 13 | 24 |
34 | NJ | 12 | 2 |
def compute_age() -> pl.Expr:
# Date64 is time in ms
ms_to_year = 1e3 * 3600 * 24 * 365
return (
lit(datetime(2021, 1, 1)) - col("birthday")
) / (ms_to_year)
def avg_age(gender: str) -> pl.Expr:
return (
compute_age()
.filter(col("gender") == gender)
.mean()
.alias(f"avg {gender} age")
)
(df.lazy()
.groupby(["state"])
.agg(
[
avg_age("M"),
avg_age("F"),
]
)
.sort("state")
.limit(5)
.collect()
)
state | avg M age | avg F age |
---|---|---|
str | f64 | f64 |
"AK" | 71.899 | 63.657 |
"AL" | 65.167 | 56.038 |
"AR" | 58.325 | null |
"AS" | null | 73.06 |
"AZ" | 60.004 | 59.168 |
Translating this to Pandas is really hard since we have no way to refer to a column. Also with Pandas' agg
we have only access to the aggregation column and cannot filter by another, thus we have to use apply
.
def p_compute_age(grp: pd.DataFrame):
# Date64 is time in ms
s_to_year = 3600 * 24 * 365
return (
(datetime(2021, 1, 1) - grp["birthday"]).dt.total_seconds()
) / (s_to_year)
def p_avg_age(grp: pd.DataFrame, gender: str):
age = p_compute_age(grp)
mean_age = age[grp["gender"] == gender].mean()
return pd.Series([mean_age], index=[f"avg {gender} age"])
(pdf.groupby("state")
.apply(
lambda grp: pd.concat(
[p_avg_age(grp, gender="M"),
p_avg_age(grp, gender="F")]
)
)
.reset_index()
.sort_values(by="state")
.head(5)
)
state | avg M age | avg F age | |
---|---|---|---|
0 | AK | 71.898630 | 63.657534 |
1 | AL | 65.167466 | 56.038356 |
2 | AR | 58.324658 | NaN |
3 | AS | NaN | 73.060274 |
4 | AZ | 60.003767 | 59.168037 |
The same code in Pandas just feels not as clean as in Polars, thus showing nicely the power that comes with Polars' composable expressions.
df = pl.DataFrame({"foo": np.arange(10),
"bar": np.random.rand(10),
"cls": np.random.randint(2, size=10)})
pdf = df.to_pandas()
df
foo | bar | cls |
---|---|---|
i64 | f64 | i64 |
0 | 0.232 | 0 |
1 | 0.359 | 0 |
2 | 0.404 | 1 |
3 | 0.855 | 0 |
4 | 0.794 | 1 |
5 | 0.598 | 1 |
6 | 0.747 | 1 |
7 | 0.656 | 0 |
8 | 0.999 | 0 |
9 | 0.769 | 1 |
Use map
for vector operations on a whole column, i.e. series -> series:
def my_custom_func(s: pl.Series) -> pl.Series:
return np.exp(s) / np.log(s)
df.filter(col("bar").map(my_custom_func, return_dtype=pl.Float64) > -1)
foo | bar | cls |
---|---|---|
i64 | f64 | i64 |
0 | 0.232 | 0 |
Use apply
for scalar operations on a cell or group level but returning a scalar:
df.select(col("bar").apply(lambda x: 3*x))
bar |
---|
f64 |
0.696 |
1.078 |
1.211 |
2.566 |
2.383 |
1.794 |
2.24 |
1.967 |
2.996 |
2.308 |
In Pandas, you have apply
and applymap
working quite similarily:
pdf[pdf["bar"].apply(my_custom_func) > -1]
foo | bar | cls | |
---|---|---|---|
0 | 0 | 0.232159 | 0 |
pdf[["bar"]].applymap(lambda x: 3*x)
bar | |
---|---|
0 | 0.696478 |
1 | 1.078346 |
2 | 1.211216 |
3 | 2.566302 |
4 | 2.382630 |
5 | 1.793900 |
6 | 2.239719 |
7 | 1.966602 |
8 | 2.995891 |
9 | 2.307976 |
df.groupby(["cls"]).agg([col("bar").apply(lambda grp: 3*grp.sum())])
cls | |
---|---|
i64 | f64 |
0 | 9.304 |
1 | 9.935 |
Quite analogous in Pandas but of course you need to fight the multi-index:
pdf.groupby("cls", as_index=False).agg({"bar": lambda grp: 3 * grp.sum()})
cls | bar | |
---|---|---|
0 | 0 | 9.303619 |
1 | 1 | 9.935442 |