This notebook explains how to identify data in columns with the wrong data type with pandas
.
import pandas as pd
import datetime
import numpy as np
We will create a dataframe that contains multiple occurances of duplication for this example.
df = pd.DataFrame({'A': ['text']*20,
'B': [1, 2.2]*10,
'C': [True, False]*10,
'D': pd.to_datetime('2020-01-01')
})
Next, add some mistyped data to the dataframe.
df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 1.0 | True | 2020-01-01 00:00:00 |
1 | -2 | 2.2 | False | 2020-01-01 00:00:00 |
2 | text | 1.0 | True | 12 |
3 | text | 2.2 | False | 2020-01-01 00:00:00 |
4 | text | 1.0 | 1 | 2020-01-01 00:00:00 |
5 | text | 2.2 | False | 2020-01-01 00:00:00 |
6 | text | 1.0 | True | 2020-01-01 00:00:00 |
7 | text | A+B | False | 2020-01-01 00:00:00 |
8 | text | 1.0 | True | 2020-01-01 00:00:00 |
9 | text | 2.2 | -12.6 | 2020-01-01 00:00:00 |
10 | 2021-01-01 00:00:00 | 1.0 | True | 2020-01-01 00:00:00 |
11 | text | 2.2 | False | 2020-01-01 00:00:00 |
12 | text | 1.0 | text | 2020-01-01 |
13 | text | 2.2 | False | 2020-01-01 00:00:00 |
14 | text | 1.0 | True | 2020-01-01 00:00:00 |
15 | text | 2.2 | False | 2020-01-01 00:00:00 |
16 | text | 1.0 | True | 2020-01-01 00:00:00 |
17 | text | 2.2 | False | 2020-01-01 00:00:00 |
18 | text | 1.0 | True | 2020-01-01 00:00:00 |
19 | text | 2.2 | False | 2020-01-01 00:00:00 |
The function applymap
and isinstance
will return a Boolean dataframe with True when the data type matches and False when the data type does not match.
numeric = df.applymap(lambda x: isinstance(x, (int, float)))
numeric
A | B | C | D | |
---|---|---|---|---|
0 | True | True | True | False |
1 | True | True | True | False |
2 | False | True | True | True |
3 | False | True | True | False |
4 | False | True | True | False |
5 | False | False | False | False |
6 | False | True | True | False |
7 | False | False | True | False |
8 | False | True | True | False |
9 | False | True | True | False |
10 | False | True | True | False |
11 | False | True | True | False |
12 | False | True | False | False |
13 | False | True | True | False |
14 | False | True | True | False |
15 | False | True | True | False |
16 | False | True | True | False |
17 | False | True | True | False |
18 | False | True | True | False |
19 | False | True | True | False |
Since only column B is supposed to be numeric, this can be made more specific by running applymap
only on column B.
numeric = df.applymap(lambda x: isinstance(x, (int, float)))['B']
numeric
0 True 1 True 2 True 3 True 4 True 5 False 6 True 7 False 8 True 9 True 10 True 11 True 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True Name: B, dtype: bool
Using this Boolean series to return the non-numeric data
df[~numeric]
A | B | C | D | |
---|---|---|---|---|
5 | text | 2.2 | False | 2020-01-01 00:00:00 |
7 | text | A+B | False | 2020-01-01 00:00:00 |
dt = df.applymap(lambda x: isinstance(x, (datetime.datetime)))['D']
dt
0 True 1 True 2 False 3 True 4 True 5 True 6 True 7 True 8 True 9 True 10 True 11 True 12 False 13 True 14 True 15 True 16 True 17 True 18 True 19 True Name: D, dtype: bool
Using this Boolean series to return the non-numeric data
df[~dt]
A | B | C | D | |
---|---|---|---|---|
2 | text | 1.0 | True | 12 |
12 | text | 1.0 | text | 2020-01-01 |
strings = df.applymap(lambda x: isinstance(x, (str)))['A']
strings
0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True 8 True 9 True 10 False 11 True 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True Name: A, dtype: bool
Using this Boolean series to return the non-numeric data
df[~strings]
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 1.0 | True | 2020-01-01 00:00:00 |
1 | -2 | 2.2 | False | 2020-01-01 00:00:00 |
10 | 2021-01-01 00:00:00 | 1.0 | True | 2020-01-01 00:00:00 |
torf = df.applymap(lambda x: isinstance(x, (bool)))['C']
torf
0 True 1 True 2 True 3 True 4 False 5 False 6 True 7 True 8 True 9 False 10 True 11 True 12 False 13 True 14 True 15 True 16 True 17 True 18 True 19 True Name: C, dtype: bool
Using this Boolean series to return the non-numeric data
df[~torf]
A | B | C | D | |
---|---|---|---|---|
4 | text | 1.0 | 1 | 2020-01-01 00:00:00 |
5 | text | 2.2 | False | 2020-01-01 00:00:00 |
9 | text | 2.2 | -12.6 | 2020-01-01 00:00:00 |
12 | text | 1.0 | text | 2020-01-01 |