This is fake data generated to demonstrate the capabilities of pyjanitor
. It contains a bunch of common problems that we regularly encounter when working with data. Let's go fix it!
Importing pyjanitor
is all that's needed to give Pandas Dataframes extra methods to work with your data.
import pandas as pd
import janitor
df = pd.read_excel('dirty_data.xlsx')
df
First Name | Last Name | Employee Status | Subject | Hire Date | % Allocated | Full time? | do not edit! ---> | Certification | Certification.1 | Certification.2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | NaN | Physical ed | Theater | NaN |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | NaN | Physical ed | Theater | NaN |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | NaN | Instr. music | Vocal music | NaN |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | NaN | PENDING | Computers | NaN |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | NaN | PENDING | NaN | NaN |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | NaN | Science 6-12 | Physics | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | NaN | NaN | English 6-12 | NaN |
9 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | NaN | PENDING | NaN | NaN |
10 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | NaN | Physical ed | NaN | NaN |
11 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | NaN | NaN | Political sci. | NaN |
12 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | NaN | Vocal music | English | NaN |
There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the clean_names()
method.
df_clean = df.clean_names()
df_clean.head(2)
first_name | last_name | employee_status | subject | hire_date | %_allocated | full_time_ | do_not_edit!_> | certification | certification_1 | certification_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | NaN | Physical ed | Theater | NaN |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | NaN | Physical ed | Theater | NaN |
Notice now how the column names have been made better.
If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the remove_empty()
method:
df_clean = df.clean_names().remove_empty()
df_clean.head(9).tail(4)
first_name | last_name | employee_status | subject | hire_date | %_allocated | full_time_ | certification | certification_1 | |
---|---|---|---|---|---|---|---|---|---|
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.5 | Yes | Science 6-12 | Physics |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.5 | Yes | Science 6-12 | Physics |
7 | James | Joyce | Teacher | English | 32994.0 | 0.5 | No | NaN | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.5 | No | PENDING | NaN |
Now this is starting to shape up well!
Next, let's rename some of the columns. %_allocated
and full_time?
contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:rename_column()
method:
df_clean = (
df
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
)
df_clean.head(5)
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | certification_1 | |
---|---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | Physical ed | Theater |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | Physical ed | Theater |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | Instr. music | Vocal music |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | PENDING | Computers |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | PENDING | NaN |
Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.
If we look more closely at the two certification
columns, we'll see that they look like this:
df_clean[['certification', 'certification_1']]
certification | certification_1 | |
---|---|---|
0 | Physical ed | Theater |
1 | Physical ed | Theater |
2 | Instr. music | Vocal music |
3 | PENDING | Computers |
4 | PENDING | NaN |
5 | Science 6-12 | Physics |
6 | Science 6-12 | Physics |
7 | NaN | English 6-12 |
8 | PENDING | NaN |
9 | Physical ed | NaN |
10 | NaN | Political sci. |
11 | Vocal music | English |
Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the coalesce()
method, which does the following:
df_clean = (
df
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce(column_names=['certification', 'certification_1'], new_column_name='certification')
)
df_clean
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | |
---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 39690.0 | 0.75 | Yes | Physical ed |
1 | Jason | Bourne | Teacher | Drafting | 39690.0 | 0.25 | Yes | Physical ed |
2 | Alicia | Keys | Teacher | Music | 37118.0 | 1.00 | Yes | Instr. music |
3 | Ada | Lovelace | Teacher | NaN | 27515.0 | 1.00 | Yes | PENDING |
4 | Desus | Nice | Administration | Dean | 41431.0 | 1.00 | Yes | PENDING |
5 | Chien-Shiung | Wu | Teacher | Physics | 11037.0 | 0.50 | Yes | Science 6-12 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 11037.0 | 0.50 | Yes | Science 6-12 |
7 | James | Joyce | Teacher | English | 32994.0 | 0.50 | No | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 27919.0 | 0.50 | No | PENDING |
9 | Carlos | Boozer | Coach | Basketball | 42221.0 | NaN | No | Physical ed |
10 | Young | Boozer | Coach | NaN | 34700.0 | NaN | No | Political sci. |
11 | Micheal | Larsen | Teacher | English | 40071.0 | 0.80 | No | Vocal music |
Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`
Finally, notice how the hire_date
column isn't date formatted. It's got this weird Excel serialization.
To clean up this data, we can use the :py:meth:convert_excel_date
method.
df_clean = (
df
.clean_names()
.remove_empty()
.rename_column('%_allocated', 'percent_allocated')
.rename_column('full_time_', 'full_time')
.coalesce(['certification', 'certification_1'], 'certification')
.convert_excel_date('hire_date')
)
df_clean
first_name | last_name | employee_status | subject | hire_date | percent_allocated | full_time | certification | |
---|---|---|---|---|---|---|---|---|
0 | Jason | Bourne | Teacher | PE | 2008-08-30 | 0.75 | Yes | Physical ed |
1 | Jason | Bourne | Teacher | Drafting | 2008-08-30 | 0.25 | Yes | Physical ed |
2 | Alicia | Keys | Teacher | Music | 2001-08-15 | 1.00 | Yes | Instr. music |
3 | Ada | Lovelace | Teacher | NaN | 1975-05-01 | 1.00 | Yes | PENDING |
4 | Desus | Nice | Administration | Dean | 2013-06-06 | 1.00 | Yes | PENDING |
5 | Chien-Shiung | Wu | Teacher | Physics | 1930-03-20 | 0.50 | Yes | Science 6-12 |
6 | Chien-Shiung | Wu | Teacher | Chemistry | 1930-03-20 | 0.50 | Yes | Science 6-12 |
7 | James | Joyce | Teacher | English | 1990-05-01 | 0.50 | No | English 6-12 |
8 | Hedy | Lamarr | Teacher | Science | 1976-06-08 | 0.50 | No | PENDING |
9 | Carlos | Boozer | Coach | Basketball | 2015-08-05 | NaN | No | Physical ed |
10 | Young | Boozer | Coach | NaN | 1995-01-01 | NaN | No | Political sci. |
11 | Micheal | Larsen | Teacher | English | 2009-09-15 | 0.80 | No | Vocal music |
We have a cleaned dataframe!