import pandas as pd
import janitor
from numpy import nan
df = pd.DataFrame({'col1': [2.0, 1.0, 3.0, 1.0, nan],
'col2': ['a', 'b', 'c', 'd', 'a'],
'col3': ['2020-01-01',
'2020-01-02',
'2020-01-03',
'2020-01-04',
'2020-01-05']})
df
col1 | col2 | col3 | |
---|---|---|---|
0 | 2.0 | a | 2020-01-01 |
1 | 1.0 | b | 2020-01-02 |
2 | 3.0 | c | 2020-01-03 |
3 | 1.0 | d | 2020-01-04 |
4 | NaN | a | 2020-01-05 |
df.dtypes
col1 float64 col2 object col3 object dtype: object
Specific columns can be converted to category type:
cat = df.encode_categorical(column_names=['col1', 'col2', 'col3'])
cat.dtypes
col1 category col2 category col3 category dtype: object
Note that for the code above, the categories were inferred from the columns, and is unordered:
cat['col3']
0 2020-01-01 1 2020-01-02 2 2020-01-03 3 2020-01-04 4 2020-01-05 Name: col3, dtype: category Categories (5, object): ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05']
Explicit categories can be provided, and ordered via the `kwargs`` parameter:
cat = df.encode_categorical(
col1 = ([3, 2, 1, 4], "appearance"),
col2 = (['a','d','c','b'], "sort")
)
cat['col1']
0 2 1 1 2 3 3 1 4 NaN Name: col1, dtype: category Categories (4, int64): [3 < 2 < 1 < 4]
cat['col2']
0 a 1 b 2 c 3 d 4 a Name: col2, dtype: category Categories (4, object): ['a' < 'd' < 'c' < 'b']
When the order
parameter is appearance
, the categories
argument is used as-is; if the order
is sort
, the categories
argument is sorted in ascending order; if order
is None``, then the
categories` argument is applied unordered.
A User Warning will be generated if some or all of the unique values in the column are not present in the provided categories
argument.
cat = df.encode_categorical(col1 = ([4, 5, 6], "appearance"))
/workspaces/pyjanitor/janitor/functions/encode_categorical.py:131: UserWarning: None of the values in col1 are in [4, 5, 6]; this might create nulls for all values in the new categorical column. categories_dict = _as_categorical_checks(df, **kwargs)
cat['col1']
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: col1, dtype: category Categories (3, int64): [4 < 5 < 6]