pyjanitor Usage Walkthrough¶

pyjanitor is a Python-based API on top of pandas inspired by the janitor R package. It aims to provide a clean, understandable interface based on method chaining for common and less-common tasks involving data cleaning and DataFrame manipulation.

The core philosophy and augmentations on pandas' approach to data cleaning and DataFrame manipulation include:

A method-chaining paradigm for coding efficiency & clarity of code through greatly improved readability.
Implementation of common, useful DataFrame manipulation tasks that saves on repetitive code.
Focus on active tense / verb approaches to function naming to provide at-a-glance understanding of a data manipulation pipeline.

Why `pyjanitor`?¶

Originally a simple port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method-chaining paradigm.

Data preprocessing is best expressed as a directed acyclic graph (DAG) of actions taken on data. We take a base data file as the starting point and perform actions on it such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows, and more.

The pandas API has been invaluable for the Python data science ecosystem and implements method chaining for a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the R statistical language ecosystem, where consistent and good API design in the dplyr package enables end-users, who are not necessarily developers, to concisely express data processing code, pyjanitor has evolved into a language for expressing the data processing DAG for pandas users.

What is method chaining?¶

To accomplish these goals, actions for which we would need to invoke imperative-style statements can be replaced with method chains that allow the user to read off the logical order of actions taken. Note the annotated example below. First, we introduce the textual description of a sample data cleaning pipeline:

Create DataFrame.
Delete one column.
Drop rows with empty values in two particular columns.
Rename another two columns.
Add a new column.
Reset index to account for the missing row we removed above

In pandas code, this would look as such:

df = pd.DataFrame(...)  # create a pandas DataFrame somehow.
del df['column1']  # delete a column from the dataframe.
df = df.dropna(subset=['column2', 'column3'])  # drop rows that have empty values in column 2 and 3.
df = df.rename({'column2': 'unicorns', 'column3': 'dragons'})  # rename column2 and column3
df['new_column'] = ['iterable', 'of', 'items']  # add a new column.
df.reset_index(inplace=True, drop=True)  # reset index to account for the missing row we removed above

The `pyjanitor` approach¶

With pyjanitor, we enable method chaining with method names that are verbs which describe the action taken:

df = (
    pd.DataFrame(...)
    .remove_columns(['column1'])
    .dropna(subset=['column2', 'column3'])
    .rename_column('column2', 'unicorns')
    .rename_column('column3', 'dragons')
    .add_column('new_column', ['iterable', 'of', 'items'])
    .reset_index(drop=True)
)

We believe the pyjanitor chaining-based approach leads to much cleaner code where the intent of a series of DataFrame manipulations is much more immediately clear.

pyjanitor’s etymology has a two-fold relationship to “cleanliness”. Firstly, it’s about extending pandas with convenient data cleaning routines. Secondly, it’s about providing a cleaner, method-chaining, verb-based API for common pandas routines.

A survey of `pyjanitor` functions¶

Cleaning column names (multi-indexes are possible!)
Removing empty rows and columns
Identifying duplicate entries
Encoding columns as categorical
Splitting your data into features and targets (for machine learning)
Adding, removing, and renaming columns
Coalesce multiple columns into a single column
Convert excel date (serial format) into a Python datetime format
Expand a single column that has delimited, categorical values into dummy-encoded variables

A full list of functionality that pyjanitor implements can be found in the API docs.

Some things that are different¶

Some pyjanitor methods are DataFrame-mutating operations, i.e., in place. Given that in a method-chaining paradigm, DataFrames that would be created at each step of the chain cannot be accessed anyway, duplication of data at each step would lead to unnecessary, potential considerable slowdowns and increased memory usage due to data-copying operations. The severity of such copying scales with DataFrame size. Take care to understand which functions change the original DataFrame you are chaining on if it is necessary to preserve that data. If it is, you can simply .copy() it as the first step in a df.copy().operation1().operation2()... chain.

How it works¶

pyjanitor relies on the Pandas Flavor package to register new functions as object methods that can be called directly on DataFrames. For example:

import pandas as pd
import pandas_flavor as pf


@pf.register_dataframe_method
def remove_column(df, column_name: str):
    del df[column_name]
    return df

df = (
    pd.read_csv('my_data.csv')
    .remove_column('my_column_name')
    .operation2(...)
)

Importing the janitor package immediately registers these functions. The fact that each DataFrame method pyjanitor registers returns the DataFrame is what gives it the capability to method chain.

Note that pyjanitor explicitly does not modify any existing pandas methods / functionality.

Demo of various `DataFrame` manipulation tasks using `pyjanitor`¶

Here, we'll walk through some useful pyjanitor-based approaches to cleaning and manipulating DataFrames.

Code preamble:¶

In [1]:

import numpy as np
import pandas as pd

Let's take a look at our dataset:¶

In [2]:

df = pd.read_excel('dirty_data.xlsx', engine='openpyxl')
df

Out[2]:

	First Name	Last Name	Employee Status	Subject	Hire Date	% Allocated	Full time?	do not edit! --->	Certification	Certification.1	Certification.2	Unnamed: 11
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	NaN	Physical ed	Theater	NaN	NaN
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	NaN	Physical ed	Theater	NaN	NaN
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	NaN	Instr. music	Vocal music	NaN	NaN
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	NaN	PENDING	Computers	NaN	NaN
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	NaN	PENDING	NaN	NaN	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	NaN	English 6-12	NaN	NaN
9	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	NaN	PENDING	NaN	NaN	NaN
10	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	NaN	Physical ed	NaN	NaN	NaN
11	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	NaN	Political sci.	NaN	NaN
12	Micheal	Larsen	Teacher	English	40071.0	0.80	No	NaN	Vocal music	English	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

We can see that this dataset is dirty in a number of ways, including the following:

Column names contain spaces, punctuation marks, and inconsistent casing
One row (7) with completely missing data
One column (do not edit! --->) with completely missing data

Clean up our data using a `pyjanitor` method-chaining pipeline¶

Let's run through a demo DataFrame cleaning procedure:

In [3]:

cleaned_df = (
    pd.read_excel('dirty_data.xlsx', engine='openpyxl')
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce(["certification", "certification_1"], "certification")
    .encode_categorical(["subject", "employee_status", "full_time"])
    .convert_excel_date("hire_date")
    .reset_index(drop=True)
)

cleaned_df

Out[3]:

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification
0	Jason	Bourne	Teacher	PE	2008-08-30	0.75	Yes	Physical ed
1	Jason	Bourne	Teacher	Drafting	2008-08-30	0.25	Yes	Physical ed
2	Alicia	Keys	Teacher	Music	2001-08-15	1.00	Yes	Instr. music
3	Ada	Lovelace	Teacher	NaN	1975-05-01	1.00	Yes	PENDING
4	Desus	Nice	Administration	Dean	2013-06-06	1.00	Yes	PENDING
5	Chien-Shiung	Wu	Teacher	Physics	1930-03-20	0.50	Yes	Science 6-12
6	Chien-Shiung	Wu	Teacher	Chemistry	1930-03-20	0.50	Yes	Science 6-12
7	James	Joyce	Teacher	English	1990-05-01	0.50	No	English 6-12
8	Hedy	Lamarr	Teacher	Science	1976-06-08	0.50	No	PENDING
9	Carlos	Boozer	Coach	Basketball	2015-08-05	NaN	No	Physical ed
10	Young	Boozer	Coach	NaN	1995-01-01	NaN	No	Political sci.
11	Micheal	Larsen	Teacher	English	2009-09-15	0.80	No	Vocal music

The cleaned DataFrame looks much better and quite a bit more usable for our downstream tasks.

Step-by-step walkthrough of `pyjanitor` `DataFrame` manipulations¶

Just for clearer understanding of the above, let's see how pyjanitor progressively modified the data.

Loading data in:

In [4]:

df = pd.read_excel('dirty_data.xlsx', engine='openpyxl')
df

Out[4]:

	First Name	Last Name	Employee Status	Subject	Hire Date	% Allocated	Full time?	do not edit! --->	Certification	Certification.1	Certification.2	Unnamed: 11
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	NaN	Physical ed	Theater	NaN	NaN
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	NaN	Physical ed	Theater	NaN	NaN
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	NaN	Instr. music	Vocal music	NaN	NaN
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	NaN	PENDING	Computers	NaN	NaN
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	NaN	PENDING	NaN	NaN	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	NaN	English 6-12	NaN	NaN
9	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	NaN	PENDING	NaN	NaN	NaN
10	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	NaN	Physical ed	NaN	NaN	NaN
11	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	NaN	Political sci.	NaN	NaN
12	Micheal	Larsen	Teacher	English	40071.0	0.80	No	NaN	Vocal music	English	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Clean up names by removing whitespace, punctuation / symbols, capitalization:

In [5]:

df = df.clean_names()
df

Out[5]:

	first_name	last_name	employee_status	subject	hire_date	%_allocated	full_time_	do_not_edit!_>	certification	certification_1	certification_2	unnamed_11
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	NaN	Physical ed	Theater	NaN	NaN
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	NaN	Physical ed	Theater	NaN	NaN
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	NaN	Instr. music	Vocal music	NaN	NaN
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	NaN	PENDING	Computers	NaN	NaN
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	NaN	PENDING	NaN	NaN	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	NaN	English 6-12	NaN	NaN
9	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	NaN	PENDING	NaN	NaN	NaN
10	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	NaN	Physical ed	NaN	NaN	NaN
11	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	NaN	Political sci.	NaN	NaN
12	Micheal	Larsen	Teacher	English	40071.0	0.80	No	NaN	Vocal music	English	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Remove entirely empty rows / columns:

In [6]:

df = df.remove_empty()
df

Out[6]:

	first_name	last_name	employee_status	subject	hire_date	%_allocated	full_time_	certification	certification_1
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	Physical ed	Theater
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	Physical ed	Theater
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	Instr. music	Vocal music
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	PENDING	Computers
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	PENDING	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	Science 6-12	Physics
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	Science 6-12	Physics
7	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	English 6-12
8	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	PENDING	NaN
9	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	Physical ed	NaN
10	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	Political sci.
11	Micheal	Larsen	Teacher	English	40071.0	0.80	No	Vocal music	English

Rename particular columns:

In [7]:

df = (
    df.rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
)
df

Out[7]:

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification	certification_1
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	Physical ed	Theater
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	Physical ed	Theater
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	Instr. music	Vocal music
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	PENDING	Computers
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	PENDING	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	Science 6-12	Physics
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	Science 6-12	Physics
7	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	English 6-12
8	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	PENDING	NaN
9	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	Physical ed	NaN
10	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	Political sci.
11	Micheal	Larsen	Teacher	English	40071.0	0.80	No	Vocal music	English

Take first non-NaN row value in two columns:

In [8]:

df = df.coalesce(["certification", "certification_1"], "certification")
df

Out[8]:

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	Physical ed
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	Physical ed
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	Instr. music
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	PENDING
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	PENDING
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	Science 6-12
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	Science 6-12
7	James	Joyce	Teacher	English	32994.0	0.50	No	English 6-12
8	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	PENDING
9	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	Physical ed
10	Young	Boozer	Coach	NaN	34700.0	NaN	No	Political sci.
11	Micheal	Larsen	Teacher	English	40071.0	0.80	No	Vocal music

Convert string object rows to categorical to save on memory consumption and speed up access:

In [9]:

df.dtypes

Out[9]:

first_name            object
last_name             object
employee_status       object
subject               object
hire_date            float64
percent_allocated    float64
full_time             object
certification         object
dtype: object

In [10]:

df.encode_categorical(["subject", "employee_status", "full_time"])
df.dtypes

Out[10]:

first_name             object
last_name              object
employee_status      category
subject              category
hire_date             float64
percent_allocated     float64
full_time            category
certification          object
dtype: object

Convert Excel date-formatted column to a more interpretable format:

In [11]:

df.convert_excel_date("hire_date")
df

Out[11]:

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification
0	Jason	Bourne	Teacher	PE	2008-08-30	0.75	Yes	Physical ed
1	Jason	Bourne	Teacher	Drafting	2008-08-30	0.25	Yes	Physical ed
2	Alicia	Keys	Teacher	Music	2001-08-15	1.00	Yes	Instr. music
3	Ada	Lovelace	Teacher	NaN	1975-05-01	1.00	Yes	PENDING
4	Desus	Nice	Administration	Dean	2013-06-06	1.00	Yes	PENDING
5	Chien-Shiung	Wu	Teacher	Physics	1930-03-20	0.50	Yes	Science 6-12
6	Chien-Shiung	Wu	Teacher	Chemistry	1930-03-20	0.50	Yes	Science 6-12
7	James	Joyce	Teacher	English	1990-05-01	0.50	No	English 6-12
8	Hedy	Lamarr	Teacher	Science	1976-06-08	0.50	No	PENDING
9	Carlos	Boozer	Coach	Basketball	2015-08-05	NaN	No	Physical ed
10	Young	Boozer	Coach	NaN	1995-01-01	NaN	No	Political sci.
11	Micheal	Larsen	Teacher	English	2009-09-15	0.80	No	Vocal music

Example analysis of the data¶

Let's perform analysis on the above, cleaned DataFrame. First we add some additional, randomly-generated data. Note that we .copy() the original to preserve it, given that the following would otherwise modify it:

In [12]:

data_df = (
    cleaned_df
    .copy()
    .add_columns(
        lucky_number=np.random.randint(0, 10, len(cleaned_df)),
        age=np.random.randint(10, 100, len(cleaned_df)),
        employee_of_month_count=np.random.randint(0, 5, len(cleaned_df))
    )
)

data_df

Out[12]:

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification	lucky_number	age	employee_of_month_count
0	Jason	Bourne	Teacher	PE	2008-08-30	0.75	Yes	Physical ed	6	74	0
1	Jason	Bourne	Teacher	Drafting	2008-08-30	0.25	Yes	Physical ed	6	90	4
2	Alicia	Keys	Teacher	Music	2001-08-15	1.00	Yes	Instr. music	2	47	0
3	Ada	Lovelace	Teacher	NaN	1975-05-01	1.00	Yes	PENDING	9	76	0
4	Desus	Nice	Administration	Dean	2013-06-06	1.00	Yes	PENDING	4	69	2
5	Chien-Shiung	Wu	Teacher	Physics	1930-03-20	0.50	Yes	Science 6-12	7	61	3
6	Chien-Shiung	Wu	Teacher	Chemistry	1930-03-20	0.50	Yes	Science 6-12	4	47	0
7	James	Joyce	Teacher	English	1990-05-01	0.50	No	English 6-12	9	34	3
8	Hedy	Lamarr	Teacher	Science	1976-06-08	0.50	No	PENDING	6	23	0
9	Carlos	Boozer	Coach	Basketball	2015-08-05	NaN	No	Physical ed	1	21	3
10	Young	Boozer	Coach	NaN	1995-01-01	NaN	No	Political sci.	1	83	1
11	Micheal	Larsen	Teacher	English	2009-09-15	0.80	No	Vocal music	5	81	2

Calculate mean, median of all numerical columns after grouping by employee status. Use .collapse_levels(), a pyjanitor convenience function, to convert the DataFrame returned by .agg() from having multi-level columns (because we supplied a list of aggregation operations) to single-level by concatenating the level names with an underscore:

In [13]:

stats_df = (
    data_df.groupby('employee_status')
    .agg(['mean', 'median'])
    .collapse_levels()
    .reset_index()
)

stats_df

Out[13]:

	employee_status	percent_allocated_mean	percent_allocated_median	lucky_number_mean	lucky_number_median	age_mean	age_median	employee_of_month_count_mean	employee_of_month_count_median
0	Administration	1.000000	1.0	4	4	69.000000	69	2.000000	2
1	Coach	NaN	NaN	1	1	52.000000	52	2.000000	2
2	Teacher	0.644444	0.5	6	6	59.222222	61	1.333333	0

pyjanitor Usage Walkthrough¶

Why pyjanitor?¶