pyjanitor
is a Python-based API on top of pandas
inspired by the janitor R package. It aims to provide a clean, understandable interface based on method chaining for common and less-common tasks involving data cleaning and DataFrame
manipulation.
The core philosophy and augmentations on pandas
' approach to data cleaning and DataFrame
manipulation include:
DataFrame
manipulation tasks that saves on repetitive code.pyjanitor
?¶Originally a simple port of the R package, pyjanitor
has evolved from a set of convenient data cleaning routines into an experiment with the method-chaining paradigm.
Data preprocessing is best expressed as a directed acyclic graph (DAG) of actions taken on data. We take a base data file as the starting point and perform actions on it such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows, and more.
The pandas
API has been invaluable for the Python data science ecosystem and implements method chaining for a subset of methods as part of the API. For example, resetting indexes (.reset_index()
), dropping null values (.dropna()
), and more are accomplished via the appropriate pd.DataFrame
method calls.
Inspired by the R statistical language ecosystem, where consistent and good API design in the dplyr
package enables end-users, who are not necessarily developers, to concisely express data processing code, pyjanitor
has evolved into a language for expressing the data processing DAG for pandas
users.
To accomplish these goals, actions for which we would need to invoke imperative-style statements can be replaced with method chains that allow the user to read off the logical order of actions taken. Note the annotated example below. First, we introduce the textual description of a sample data cleaning pipeline:
DataFrame
.In pandas
code, this would look as such:
df = pd.DataFrame(...) # create a pandas DataFrame somehow.
del df['column1'] # delete a column from the dataframe.
df = df.dropna(subset=['column2', 'column3']) # drop rows that have empty values in column 2 and 3.
df = df.rename({'column2': 'unicorns', 'column3': 'dragons'}) # rename column2 and column3
df['new_column'] = ['iterable', 'of', 'items'] # add a new column.
df.reset_index(inplace=True, drop=True) # reset index to account for the missing row we removed above
pyjanitor
approach¶With pyjanitor
, we enable method chaining with method names that are verbs which describe the action taken:
df = (
pd.DataFrame(...)
.remove_columns(['column1'])
.dropna(subset=['column2', 'column3'])
.rename_column('column2', 'unicorns')
.rename_column('column3', 'dragons')
.add_column('new_column', ['iterable', 'of', 'items'])
.reset_index(drop=True)
)
We believe the pyjanitor
chaining-based approach leads to much cleaner code where the intent of a series of DataFrame
manipulations is much more immediately clear.
pyjanitor
’s etymology has a two-fold relationship to “cleanliness”. Firstly, it’s about extending pandas
with convenient data cleaning routines. Secondly, it’s about providing a cleaner, method-chaining, verb-based API for common pandas
routines.
pyjanitor
functions¶A full list of functionality that pyjanitor
implements can be found in the API docs.
Some pyjanitor
methods are DataFrame
-mutating operations, i.e., in place. Given that in a method-chaining paradigm, DataFrame
s that would be created at each step of the chain cannot be accessed anyway, duplication of data at each step would lead to unnecessary, potential considerable slowdowns and increased memory usage due to data-copying operations. The severity of such copying scales with DataFrame
size. Take care to understand which functions change the original DataFrame
you are chaining on if it is necessary to preserve that data. If it is, you can simply .copy()
it as the first step in a df.copy().operation1().operation2()...
chain.
pyjanitor
relies on the Pandas Flavor package to register new functions as object methods that can be called directly on DataFrame
s. For example:
import pandas as pd
import pandas_flavor as pf
@pf.register_dataframe_method
def remove_column(df, column_name: str):
del df[column_name]
return df
df = (
pd.read_csv('my_data.csv')
.remove_column('my_column_name')
.operation2(...)
)
Importing the janitor
package immediately registers these functions. The fact that each DataFrame
method pyjanitor
registers returns the DataFrame
is what gives it the capability to method chain.
Note that pyjanitor
explicitly does not modify any existing pandas
methods / functionality.
DataFrame
manipulation tasks using pyjanitor
¶Here, we'll walk through some useful pyjanitor
-based approaches to cleaning and manipulating DataFrame
s.
import numpy as np
import pandas as pd
df = pd.read_excel('dirty_data.xlsx', engine='openpyxl')
df
First Name | Last Name | Employee Status | Subject | Hire Date | % Allocated | Full time? | do not edit! ---> | Certification | Certification.1 | Certification.2 | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | NaN | Physical ed | Theater | NaN | NaN |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | NaN | Physical ed | Theater | NaN | NaN |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | NaN | Instr. music | Vocal music | NaN | NaN |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | NaN | PENDING | Computers | NaN | NaN |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | NaN | PENDING | NaN | NaN | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | NaN | English 6-12 | NaN | NaN |
9 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | NaN | PENDING | NaN | NaN | NaN |
10 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | NaN | Physical ed | NaN | NaN | NaN |
11 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | NaN | Political sci. | NaN | NaN |
12 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | NaN | Vocal music | English | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We can see that this dataset is dirty in a number of ways, including the following:
7
) with completely missing datado not edit! --->
) with completely missing datapyjanitor
method-chaining pipeline¶Let's run through a demo DataFrame
cleaning procedure:
cleaned_df = (
pd.read_excel('dirty_data.xlsx', engine='openpyxl')
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce(["certification", "certification_1"], "certification")
.encode_categorical(["subject", "employee_status", "full_time"])
.convert_excel_date("hire_date")
.reset_index(drop=True)
)
cleaned_df
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | |
---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 2008-08-30 | 0.75 | Yes | Physical ed |
1 | Jason | Bourne | Teacher | Drafting | 2008-08-30 | 0.25 | Yes | Physical ed |
2 | Alicia | Keys | Teacher | Music | 2001-08-15 | 1.00 | Yes | Instr. music |
3 | Ada | Lovelace | Teacher | NaN | 1975-05-01 | 1.00 | Yes | PENDING |
4 | Desus | Nice | Administration | Dean | 2013-06-06 | 1.00 | Yes | PENDING |
5 | Chien-Shiung | Wu | Teacher | Physics | 1930-03-20 | 0.50 | Yes | Science 6-12 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 1930-03-20 | 0.50 | Yes | Science 6-12 |
7 | James | Joyce | Teacher | English | 1990-05-01 | 0.50 | No | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 1976-06-08 | 0.50 | No | PENDING |
9 | Carlos | Boozer | Coach | Basketball | 2015-08-05 | NaN | No | Physical ed |
10 | Young | Boozer | Coach | NaN | 1995-01-01 | NaN | No | Political sci. |
11 | Micheal | Larsen | Teacher | English | 2009-09-15 | 0.80 | No | Vocal music |
The cleaned DataFrame
looks much better and quite a bit more usable for our downstream tasks.
pyjanitor
DataFrame
manipulations¶Just for clearer understanding of the above, let's see how pyjanitor
progressively modified the data.
Loading data in:
df = pd.read_excel('dirty_data.xlsx', engine='openpyxl')
df
First Name | Last Name | Employee Status | Subject | Hire Date | % Allocated | Full time? | do not edit! ---> | Certification | Certification.1 | Certification.2 | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | NaN | Physical ed | Theater | NaN | NaN |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | NaN | Physical ed | Theater | NaN | NaN |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | NaN | Instr. music | Vocal music | NaN | NaN |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | NaN | PENDING | Computers | NaN | NaN |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | NaN | PENDING | NaN | NaN | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | NaN | English 6-12 | NaN | NaN |
9 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | NaN | PENDING | NaN | NaN | NaN |
10 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | NaN | Physical ed | NaN | NaN | NaN |
11 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | NaN | Political sci. | NaN | NaN |
12 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | NaN | Vocal music | English | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Clean up names by removing whitespace, punctuation / symbols, capitalization:
df = df.clean_names()
df
first_name | last_name | employee_status | subject | hire_date | %_allocated | full_time_ | do_not_edit!_> | certification | certification_1 | certification_2 | unnamed_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | NaN | Physical ed | Theater | NaN | NaN |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | NaN | Physical ed | Theater | NaN | NaN |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | NaN | Instr. music | Vocal music | NaN | NaN |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | NaN | PENDING | Computers | NaN | NaN |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | NaN | PENDING | NaN | NaN | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | NaN | English 6-12 | NaN | NaN |
9 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | NaN | PENDING | NaN | NaN | NaN |
10 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | NaN | Physical ed | NaN | NaN | NaN |
11 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | NaN | Political sci. | NaN | NaN |
12 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | NaN | Vocal music | English | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Remove entirely empty rows / columns:
df = df.remove_empty()
df
first_name | last_name | employee_status | subject | hire_date | %_allocated | full_time_ | certification | certification_1 | |
---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | Physical ed | Theater |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | Physical ed | Theater |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | Instr. music | Vocal music |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | PENDING | Computers |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | PENDING | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | Science 6-12 | Physics |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | Science 6-12 | Physics |
7 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | PENDING | NaN |
9 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | Physical ed | NaN |
10 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | Political sci. |
11 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | Vocal music | English |
Rename particular columns:
df = (
df.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
)
df
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | certification_1 | |
---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | Physical ed | Theater |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | Physical ed | Theater |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | Instr. music | Vocal music |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | PENDING | Computers |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | PENDING | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | Science 6-12 | Physics |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | Science 6-12 | Physics |
7 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | PENDING | NaN |
9 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | Physical ed | NaN |
10 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | Political sci. |
11 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | Vocal music | English |
Take first non-NaN
row value in two columns:
df = df.coalesce(["certification", "certification_1"], "certification")
df
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | |
---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | Physical ed |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | Physical ed |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | Instr. music |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | PENDING |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | PENDING |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | Science 6-12 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | Science 6-12 |
7 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | PENDING |
9 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | Physical ed |
10 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | Political sci. |
11 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | Vocal music |
Convert string object rows to categorical to save on memory consumption and speed up access:
df.dtypes
first_name object last_name object employee_status object subject object hire_date float64 percent_allocated float64 full_time object certification object dtype: object
df.encode_categorical(["subject", "employee_status", "full_time"])
df.dtypes
first_name object last_name object employee_status category subject category hire_date float64 percent_allocated float64 full_time category certification object dtype: object
Convert Excel date-formatted column to a more interpretable format:
df.convert_excel_date("hire_date")
df
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | |
---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 2008-08-30 | 0.75 | Yes | Physical ed |
1 | Jason | Bourne | Teacher | Drafting | 2008-08-30 | 0.25 | Yes | Physical ed |
2 | Alicia | Keys | Teacher | Music | 2001-08-15 | 1.00 | Yes | Instr. music |
3 | Ada | Lovelace | Teacher | NaN | 1975-05-01 | 1.00 | Yes | PENDING |
4 | Desus | Nice | Administration | Dean | 2013-06-06 | 1.00 | Yes | PENDING |
5 | Chien-Shiung | Wu | Teacher | Physics | 1930-03-20 | 0.50 | Yes | Science 6-12 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 1930-03-20 | 0.50 | Yes | Science 6-12 |
7 | James | Joyce | Teacher | English | 1990-05-01 | 0.50 | No | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 1976-06-08 | 0.50 | No | PENDING |
9 | Carlos | Boozer | Coach | Basketball | 2015-08-05 | NaN | No | Physical ed |
10 | Young | Boozer | Coach | NaN | 1995-01-01 | NaN | No | Political sci. |
11 | Micheal | Larsen | Teacher | English | 2009-09-15 | 0.80 | No | Vocal music |
Let's perform analysis on the above, cleaned DataFrame
. First we add some additional, randomly-generated data. Note that we .copy()
the original to preserve it, given that the following would otherwise modify it:
data_df = (
cleaned_df
.copy()
.add_columns(
lucky_number=np.random.randint(0, 10, len(cleaned_df)),
age=np.random.randint(10, 100, len(cleaned_df)),
employee_of_month_count=np.random.randint(0, 5, len(cleaned_df))
)
)
data_df
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | lucky_number | age | employee_of_month_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 2008-08-30 | 0.75 | Yes | Physical ed | 6 | 74 | 0 |
1 | Jason | Bourne | Teacher | Drafting | 2008-08-30 | 0.25 | Yes | Physical ed | 6 | 90 | 4 |
2 | Alicia | Keys | Teacher | Music | 2001-08-15 | 1.00 | Yes | Instr. music | 2 | 47 | 0 |
3 | Ada | Lovelace | Teacher | NaN | 1975-05-01 | 1.00 | Yes | PENDING | 9 | 76 | 0 |
4 | Desus | Nice | Administration | Dean | 2013-06-06 | 1.00 | Yes | PENDING | 4 | 69 | 2 |
5 | Chien-Shiung | Wu | Teacher | Physics | 1930-03-20 | 0.50 | Yes | Science 6-12 | 7 | 61 | 3 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 1930-03-20 | 0.50 | Yes | Science 6-12 | 4 | 47 | 0 |
7 | James | Joyce | Teacher | English | 1990-05-01 | 0.50 | No | English 6-12 | 9 | 34 | 3 |
8 | Hedy | Lamarr | Teacher | Science | 1976-06-08 | 0.50 | No | PENDING | 6 | 23 | 0 |
9 | Carlos | Boozer | Coach | Basketball | 2015-08-05 | NaN | No | Physical ed | 1 | 21 | 3 |
10 | Young | Boozer | Coach | NaN | 1995-01-01 | NaN | No | Political sci. | 1 | 83 | 1 |
11 | Micheal | Larsen | Teacher | English | 2009-09-15 | 0.80 | No | Vocal music | 5 | 81 | 2 |
Calculate mean, median of all numerical columns after grouping by employee status. Use .collapse_levels()
, a pyjanitor
convenience function, to convert the DataFrame
returned by .agg()
from having multi-level columns (because we supplied a list of aggregation operations) to single-level by concatenating the level names with an underscore:
stats_df = (
data_df.groupby('employee_status')
.agg(['mean', 'median'])
.collapse_levels()
.reset_index()
)
stats_df
employee_status | percent_allocated_mean | percent_allocated_median | lucky_number_mean | lucky_number_median | age_mean | age_median | employee_of_month_count_mean | employee_of_month_count_median | |
---|---|---|---|---|---|---|---|---|---|
0 | Administration | 1.000000 | 1.0 | 4 | 4 | 69.000000 | 69 | 2.000000 | 2 |
1 | Coach | NaN | NaN | 1 | 1 | 52.000000 | 52 | 2.000000 | 2 |
2 | Teacher | 0.644444 | 0.5 | 6 | 6 | 59.222222 | 61 | 1.333333 | 0 |