This as short introduction to Polars to get you started with the basic concepts of data wrangling. It is very much influenced by 10 minutes to pandas.
We start by importing Polars. If you run this for the first time, get a coffee. This will take a while
use polars::prelude::*;
#[macro_use]
extern crate polars;
Creating a Series
by passing a list of nullable values. Note that we use Option<T>
to describe missing values.
Series::new(
"some_values with ones",
&[Some(1), Some(3), Some(5), None, Some(6), Some(8)]
)
Series: i32 [ 1 3 5 null 6 8 ]
If we dont have any missing values, we can just pass a slice of T
.
Series::new(
"some_non_null_values",
&[1, 3, 5, 7, 6, 8]
)
Series: i32 [ 1 3 5 7 6 8 ]
The Series
are actually an Enum
around different typed values of a ChunkedArray
.
You can think of a ChunedkArray
as an array with a known type. Every ChunkedArray
has a type alias that makes them more convenient to use.
Some examples are:
Type | Alias |
---|---|
ChunkedArray<Float64Type> |
Float64Chunked |
ChunkedArray<UInt32Type> |
UInt32Chunked |
ChunkedArray<BooleanType> |
BooleanChunked |
ChunkedArray<Utf8Type> |
Utf8Chunked |
See all available data types here.
Create a ChunkedArray
with null values:
Int64Chunked::new_from_opt_slice("nullable", &[None, Some(1), Some(2)])
[PrimitiveArray<Int64> [ null, 1, 2, ]]
Or create a ChunkedArray
without null values.
Int64Chunked::new_from_slice("non-nullable", &[1, 2, 3])
[PrimitiveArray<Int64> [ 1, 2, 3, ]]
Converting from Series
to a ChunkedArray
can be done by defining there type.
let s = Series::new("values", &[1, 2, 3]);
s.i32()
Ok([PrimitiveArray<Int32> [ 1, 2, 3, ]])
This will return an Err
if you specify the wrong type.
s.i64()
Err(DataTypeMisMatch)
But we can cast a Series
to the proper type and then unpack.
s.cast::<Int64Type>().unwrap().i64()
Ok([PrimitiveArray<Int64> [ 1, 2, 3, ]])
Below we use pattern matching to check if the cast was successful. Note that the clones on a ChunkedArray
and a Series
are very cheap, as the underlying data is wrapped by an Arc
.
let ca = match s.i64() {
Err(_) => {
s.cast::<Int64Type>()
.unwrap()
.i64()
.map(|ca| ca.clone())
.unwrap()
},
Ok(ca) => ca.clone()
};
ca
[PrimitiveArray<Int64> [ 1, 2, 3, ]]
Converting from a ChunkedArray
to a Series
.
ca.into_series()
Series: i64 [ 1 2 3 ]
A DataFrame
is created from a Vec
of Series
.
let dates = &[
"2020-08-21",
"2020-08-21",
"2020-08-22",
"2020-08-23",
"2020-08-22",
];
let fmt = "%Y-%m-%d";
let s0 = Date32Chunked::parse_from_str_slice("dates", dates, fmt).into();
let s1 = Series::new("n", &[1, 2, 3, 4, 5]);
let s2 = Utf8Chunked::full("foos", "foo", 5).into();
let df = DataFrame::new(vec![s0, s1, s2]).expect("something went wrong");
df
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-21 | 1 | "foo" | +------------+-----+-------+ | 2020-08-21 | 2 | "foo" | +------------+-----+-------+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+ | 2020-08-22 | 5 | "foo" | +------------+-----+-------+
The columns of the resulting DataFrame
have different data types.
df.dtypes()
.iter()
.zip(df.columns().iter())
.for_each(|(dtype, name)|
println!("Column: '{}',\t dtype: {:?}", name, dtype))
Column: 'dates', dtype: Date32(Day) Column: 'n', dtype: Int32 Column: 'foos', dtype: Utf8
()
df.columns()
["dates", "n", "foos"]
Here is how to view the top and bottom rows of a DataFrame.
df.head(Some(3))
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-21 | 1 | "foo" | +------------+-----+-------+ | 2020-08-21 | 2 | "foo" | +------------+-----+-------+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+
df.tail(Some(3))
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+ | 2020-08-22 | 5 | "foo" | +------------+-----+-------+
Sorting by a column:
let reverse = true;
df.sort("dates", reverse).expect("column not sortable")
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-22 | 5 | "foo" | +------------+-----+-------+ | 2020-08-21 | 1 | "foo" | +------------+-----+-------+ | 2020-08-21 | 2 | "foo" | +------------+-----+-------+
Selecting a single column, which yields a Result<Series>
:
df.column("dates")
.expect("columns don't exist")
Series: date32(day) [ 2020-08-21 2020-08-21 2020-08-22 2020-08-23 2020-08-22 ]
Selecting 1 or multiple columns, which yield another Result<DataFrame>
:
df.select("dates")
.expect("column does not exist")
+------------+ | dates | | --- | | date32 | +============+ | 2020-08-21 | +------------+ | 2020-08-21 | +------------+ | 2020-08-22 | +------------+ | 2020-08-23 | +------------+ | 2020-08-22 | +------------+
df.select(&["dates", "n"])
.expect("column does not exist")
+------------+-----+ | dates | n | | --- | --- | | date32 | i32 | +============+=====+ | 2020-08-21 | 1 | +------------+-----+ | 2020-08-21 | 2 | +------------+-----+ | 2020-08-22 | 3 | +------------+-----+ | 2020-08-23 | 4 | +------------+-----+ | 2020-08-22 | 5 | +------------+-----+
A DataFrame
can also be sliced in to a subset of the DataFrame.
let offset = 2;
let length = 2;
df.slice(offset, length)
.expect("slice was not within bounds")
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+
Select a column by index:
df.select_at_idx(1)
.expect("column was not within bounds")
Series: i32 [ 1 2 3 4 5 ]
Boolean indexes can be used to filter data. Note that this also works on Series
and ChunkedArray
. We also use the as_result!
macro. This utility expects a block that returns a Result<T, PolarsError>
. This makes it to convenient to use the ?
operator.
as_result!({
// select the n column
let n_s = df.column("n")?;
let mask = n_s.gt(2);
// filter values > 2
df.filter(&mask)
}).unwrap()
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+ | 2020-08-22 | 5 | "foo" | +------------+-----+-------+
Filter all values in the "n" column greater than 2 and smaller than 5:
as_result!({
// select the n column
let n_s = df.column("n")?;
// create the boolean mask
let mask = (n_s.gt(2) & n_s.lt(5))?;
// filter values > 2
df.filter(&mask)
}).unwrap()
+------------+-----+-------+ | dates | n | foos | | --- | --- | --- | | date32 | i32 | str | +============+=====+=======+ | 2020-08-22 | 3 | "foo" | +------------+-----+-------+ | 2020-08-23 | 4 | "foo" | +------------+-----+-------+
For all the comparison methods available on Series
and ChunkArrays
check the ChunkCompare trait.
Setting a new column can be done with the hstack
operation. This is operation adds new columns to the existing DataFrame
.
let mut df = df;
let s = Series::new("days", &["mo", "tue", "wed", "thu", "fri"]);
df.hstack(&[s]).unwrap()
+------------+-----+-------+-------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=======+=======+ | 2020-08-21 | 1 | "foo" | "mo" | +------------+-----+-------+-------+ | 2020-08-21 | 2 | "foo" | "tue" | +------------+-----+-------+-------+ | 2020-08-22 | 3 | "foo" | "wed" | +------------+-----+-------+-------+ | 2020-08-23 | 4 | "foo" | "thu" | +------------+-----+-------+-------+ | 2020-08-22 | 5 | "foo" | "fri" | +------------+-----+-------+-------+
It isn't possible to get mutable access to the columns of a DataFrame
, because this would give you the possibility to invalidate the DataFrame
(for instance by replacing the column with a Series
with a different length).
Luckely there are other ways to mutate a DataFrame. We could for instance replace a column in the DataFrame
:
let s = Utf8Chunked::full("bars", "bar", 5);
df.replace("foos", s).unwrap()
+------------+-----+-------+-------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=======+=======+ | 2020-08-21 | 1 | "bar" | "mo" | +------------+-----+-------+-------+ | 2020-08-21 | 2 | "bar" | "tue" | +------------+-----+-------+-------+ | 2020-08-22 | 3 | "bar" | "wed" | +------------+-----+-------+-------+ | 2020-08-23 | 4 | "bar" | "thu" | +------------+-----+-------+-------+ | 2020-08-22 | 5 | "bar" | "fri" | +------------+-----+-------+-------+
Or if we want to use the column we're replacing to determine the new column's values we can use the apply
method and use a closure to create the new column.
Below we use this determine n + 1
:
df.apply("n", |s| s + 1).unwrap()
+------------+-----+-------+-------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=======+=======+ | 2020-08-21 | 2 | "bar" | "mo" | +------------+-----+-------+-------+ | 2020-08-21 | 3 | "bar" | "tue" | +------------+-----+-------+-------+ | 2020-08-22 | 4 | "bar" | "wed" | +------------+-----+-------+-------+ | 2020-08-23 | 5 | "bar" | "thu" | +------------+-----+-------+-------+ | 2020-08-22 | 6 | "bar" | "fri" | +------------+-----+-------+-------+
Both the replace
and the apply
methods exist for selection by index;
replace_at_idx
apply_at_idx
df.apply_at_idx(1, |s| s * 2)
.unwrap()
+------------+-----+-------+-------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=======+=======+ | 2020-08-21 | 4 | "bar" | "mo" | +------------+-----+-------+-------+ | 2020-08-21 | 6 | "bar" | "tue" | +------------+-----+-------+-------+ | 2020-08-22 | 8 | "bar" | "wed" | +------------+-----+-------+-------+ | 2020-08-23 | 10 | "bar" | "thu" | +------------+-----+-------+-------+ | 2020-08-22 | 12 | "bar" | "fri" | +------------+-----+-------+-------+
Or we can apply a closure to the values that are valid under a condition constraint:
as_result!({
let mask = (df.column("n")?.gt(4) & df.column("n")?.lt(10))?;
df.may_apply("foos", |s| {
s.utf8()?
.set(&!mask, Some("not_within_bounds"))
}
)
}
).unwrap()
+------------+-----+---------------------+-------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=====================+=======+ | 2020-08-21 | 4 | "not_within_bounds" | "mo" | +------------+-----+---------------------+-------+ | 2020-08-21 | 6 | "bar" | "tue" | +------------+-----+---------------------+-------+ | 2020-08-22 | 8 | "bar" | "wed" | +------------+-----+---------------------+-------+ | 2020-08-23 | 10 | "not_within_bounds" | "thu" | +------------+-----+---------------------+-------+ | 2020-08-22 | 12 | "not_within_bounds" | "fri" | +------------+-----+---------------------+-------+
Every ChunkedArray
implements the IntoIterator trait which gives us all the powerful trait methods available for iterators.
as_result!({
let s = Series::new("a", [1, 2, 3, 4, 5]);
let v = s.i32()?
.into_iter()
.sum::<Option<i32>>();
println!("{:?}", v);
Ok(())
})
Some(15)
Ok(())
as_result!({
// adds "ay" to every word.
fn to_pig_latin(opt_val: Option<&str>) -> Option<String> {
opt_val.map(|val| format!("{}_ay", val))
}
// may apply takes a closure that may fail.
df.may_apply("days", |s| {
let ca: Utf8Chunked = s.utf8()?
.into_iter()
.map(to_pig_latin)
.collect();
Ok(ca)
});
Ok(df.clone())
}).unwrap()
+------------+-----+---------------------+----------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=====================+==========+ | 2020-08-21 | 4 | "not_within_bounds" | "mo_ay" | +------------+-----+---------------------+----------+ | 2020-08-21 | 6 | "bar" | "tue_ay" | +------------+-----+---------------------+----------+ | 2020-08-22 | 8 | "bar" | "wed_ay" | +------------+-----+---------------------+----------+ | 2020-08-23 | 10 | "not_within_bounds" | "thu_ay" | +------------+-----+---------------------+----------+ | 2020-08-22 | 12 | "not_within_bounds" | "fri_ay" | +------------+-----+---------------------+----------+
Polars provides various facilities for easily combining DataFrames
and Series
.
We can concatenate a DataFrame
with hstack
:
{
let mut df1 = df.clone();
df1.hstack(df.get_columns());
println!("{:?}", df1);
};
+------------+-----+---------------------+----------+------------+-----+---------------------+----------+ | dates | n | foos | days | dates | n | foos | days | | --- | --- | --- | --- | --- | --- | --- | --- | | date32 | i32 | str | str | date32 | i32 | str | str | +============+=====+=====================+==========+============+=====+=====================+==========+ | 2020-08-21 | 4 | "not_within_bounds" | "mo_ay" | 2020-08-21 | 4 | "not_within_bounds" | "mo_ay" | +------------+-----+---------------------+----------+------------+-----+---------------------+----------+ | 2020-08-21 | 6 | "bar" | "tue_ay" | 2020-08-21 | 6 | "bar" | "tue_ay" | +------------+-----+---------------------+----------+------------+-----+---------------------+----------+ | 2020-08-22 | 8 | "bar" | "wed_ay" | 2020-08-22 | 8 | "bar" | "wed_ay" | +------------+-----+---------------------+----------+------------+-----+---------------------+----------+ | 2020-08-23 | 10 | "not_within_bounds" | "thu_ay" | 2020-08-23 | 10 | "not_within_bounds" | "thu_ay" | +------------+-----+---------------------+----------+------------+-----+---------------------+----------+ | 2020-08-22 | 12 | "not_within_bounds" | "fri_ay" | 2020-08-22 | 12 | "not_within_bounds" | "fri_ay" | +------------+-----+---------------------+----------+------------+-----+---------------------+----------+
Or append the rows of a second DataFrame:
{
let mut df1 = df.clone();
df1.vstack(&df);
println!("{:?}", df1);
};
+------------+-----+---------------------+----------+ | dates | n | foos | days | | --- | --- | --- | --- | | date32 | i32 | str | str | +============+=====+=====================+==========+ | 2020-08-21 | 4 | "not_within_bounds" | "mo_ay" | +------------+-----+---------------------+----------+ | 2020-08-21 | 6 | "bar" | "tue_ay" | +------------+-----+---------------------+----------+ | 2020-08-22 | 8 | "bar" | "wed_ay" | +------------+-----+---------------------+----------+ | 2020-08-23 | 10 | "not_within_bounds" | "thu_ay" | +------------+-----+---------------------+----------+
SQL-style joins.
as_result!({
let left = DataFrame::new(vec![
Series::new("key", &["foo", "foo"]),
Series::new("lval", &[1, 2]),
])?;
let right = DataFrame::new(vec![
Series::new("key", &["foo", "foo"]),
Series::new("rval", &[4, 5]),
])?;
println!("{:?}", left);
println!("{:?}", right);
left.inner_join(&right, "key", "key")
}).unwrap()
| 2020-08-22 | 12 | "not_within_bounds" | "fri_ay" | +------------+-----+---------------------+----------+ | 2020-08-21 | 4 | "not_within_bounds" | "mo_ay" | +------------+-----+---------------------+----------+ | 2020-08-21 | 6 | "bar" | "tue_ay" | +------------+-----+---------------------+----------+ | 2020-08-22 | 8 | "bar" | "wed_ay" | +------------+-----+---------------------+----------+ | 2020-08-23 | 10 | "not_within_bounds" | "thu_ay" | +------------+-----+---------------------+----------+ | 2020-08-22 | 12 | "not_within_bounds" | "fri_ay" | +------------+-----+---------------------+----------+ +-------+------+ | key | lval | | --- | --- | | str | i32 | +=======+======+ | "foo" | 1 | +-------+------+ | "foo" | 2 | +-------+------+ +-------+------+ | key | rval | | --- | --- | | str | i32 | +=======+======+ | "foo" | 4 | +-------+------+ | "foo" | 5 | +-------+------+
+-------+------+------+ | key | lval | rval | | --- | --- | --- | | str | i32 | i32 | +=======+======+======+ | "foo" | 1 | 4 | +-------+------+------+ | "foo" | 2 | 4 | +-------+------+------+ | "foo" | 1 | 5 | +-------+------+------+ | "foo" | 2 | 5 | +-------+------+------+
Another example that can be given is:
as_result!({
let left = DataFrame::new(vec![
Series::new("key", &["foo", "bar"]),
Series::new("lval", &[1, 2]),
])?;
let right = DataFrame::new(vec![
Series::new("key", &["foo", "bar"]),
Series::new("rval", &[4, 5]),
])?;
println!("{:?}", left);
println!("{:?}", right);
left.inner_join(&right, "key", "key")
}).unwrap()
+-------+------+ | key | lval | | --- | --- | | str | i32 | +=======+======+ | "foo" | 1 | +-------+------+ | "bar" | 2 | +-------+------+ +-------+------+ | key | rval | | --- | --- | | str | i32 | +=======+======+ | "foo" | 4 | +-------+------+ | "bar" | 5 | +-------+------+
+-------+------+------+ | key | lval | rval | | --- | --- | --- | | str | i32 | i32 | +=======+======+======+ | "foo" | 1 | 4 | +-------+------+------+ | "bar" | 2 | 5 | +-------+------+------+
By "group by" we are referring to a process involving one or more of the following steps:
let df = DataFrame::new(vec![
Series::new("A", &["foo", "bar", "foo", "bar",
"foo", "bar", "foo", "foo"]),
Series::new("B", &["one", "one", "two", "three",
"two", "two", "one", "three"]),
Int32Chunked::full("C", 1, 8).into(),
Series::new("D", &[1, 2, 3, 4,
5, 6, 7, 8])
]).unwrap();
df
+-------+---------+-----+-----+ | A | B | C | D | | --- | --- | --- | --- | | str | str | i32 | i32 | +=======+=========+=====+=====+ | "foo" | "one" | 1 | 1 | +-------+---------+-----+-----+ | "bar" | "one" | 1 | 2 | +-------+---------+-----+-----+ | "foo" | "two" | 1 | 3 | +-------+---------+-----+-----+ | "bar" | "three" | 1 | 4 | +-------+---------+-----+-----+ | "foo" | "two" | 1 | 5 | +-------+---------+-----+-----+ | "bar" | "two" | 1 | 6 | +-------+---------+-----+-----+ | "foo" | "one" | 1 | 7 | +-------+---------+-----+-----+ | "foo" | "three" | 1 | 8 | +-------+---------+-----+-----+
Grouping and then applying the sum()
method to the resulting groups:
as_result!({
(&df).groupby("A")?.select("C").sum()
}).unwrap()
+-------+-------+ | A | C_sum | | --- | --- | | str | i32 | +=======+=======+ | "foo" | 5 | +-------+-------+ | "bar" | 3 | +-------+-------+
as_result!({
(&df).groupby(&["A", "B"])?.select("C").sum()
}).unwrap()
+-------+---------+-------+ | A | B | C_sum | | --- | --- | --- | | str | str | i32 | +=======+=========+=======+ | "bar" | "three" | 1 | +-------+---------+-------+ | "foo" | "one" | 2 | +-------+---------+-------+ | "bar" | "two" | 1 | +-------+---------+-------+ | "foo" | "three" | 1 | +-------+---------+-------+ | "bar" | "one" | 1 | +-------+---------+-------+ | "foo" | "two" | 2 | +-------+---------+-------+
Pivots create a summary table by a applying a groupby and defining a pivot column and values to aggregate.
let s0 = Series::new("A", &["one", "one", "two", "three",
"one", "one", "two", "three",
"one", "one", "two", "three"
]);
let s1 = Series::new("B", &["A", "B", "C",
"A", "B", "C",
"A", "B", "C",
"A", "B", "C",
]);
let s2 = Series::new("C", &["foo", "foo", "foo", "bar", "bar", "bar",
"foo", "foo", "foo", "bar", "bar", "bar"
]);
let s3 = Series::new("E", &((0..12).collect::<Vec<_>>()));
let df = DataFrame::new(vec![s0, s1, s2, s3]).unwrap();
df
+---------+-----+-------+-----+ | A | B | C | E | | --- | --- | --- | --- | | str | str | str | i32 | +=========+=====+=======+=====+ | "one" | "A" | "foo" | 0 | +---------+-----+-------+-----+ | "one" | "B" | "foo" | 1 | +---------+-----+-------+-----+ | "two" | "C" | "foo" | 2 | +---------+-----+-------+-----+ | "three" | "A" | "bar" | 3 | +---------+-----+-------+-----+ | "one" | "B" | "bar" | 4 | +---------+-----+-------+-----+ | "one" | "C" | "bar" | 5 | +---------+-----+-------+-----+ | "two" | "A" | "foo" | 6 | +---------+-----+-------+-----+ | "three" | "B" | "foo" | 7 | +---------+-----+-------+-----+ | "one" | "C" | "foo" | 8 | +---------+-----+-------+-----+ | "one" | "A" | "bar" | 9 | +---------+-----+-------+-----+
as_result!({
(&df).groupby(&["A"])?.pivot("C", "E").sum()
}).unwrap()
+---------+-----+-----+ | A | foo | bar | | --- | --- | --- | | str | i32 | i32 | +=========+=====+=====+ | "one" | 9 | 18 | +---------+-----+-----+ | "three" | 7 | 14 | +---------+-----+-----+ | "two" | 8 | 10 | +---------+-----+-----+
as_result!({
(&df).groupby(&["A", "B"])?.pivot("C", "E").sum()
}).unwrap()
+---------+-----+------+------+ | A | B | foo | bar | | --- | --- | --- | --- | | str | str | i32 | i32 | +=========+=====+======+======+ | "three" | "A" | null | 3 | +---------+-----+------+------+ | "one" | "C" | 8 | 5 | +---------+-----+------+------+ | "two" | "A" | 6 | null | +---------+-----+------+------+ | "one" | "A" | 0 | 9 | +---------+-----+------+------+ | "two" | "C" | 2 | null | +---------+-----+------+------+ | "three" | "B" | 7 | null | +---------+-----+------+------+ | "one" | "B" | 1 | 4 | +---------+-----+------+------+ | "two" | "B" | null | 10 | +---------+-----+------+------+ | "three" | "C" | null | 11 | +---------+-----+------+------+