Deletes rows with missing values.
DropMissingData works both with numerical and categorical variables.
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
The version of the dataset used in this notebook can be obtained from Kaggle
# Make sure you are using this
# Feature-engine version.
import feature_engine
feature_engine.__version__
'1.2.0'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import DropMissingData
# Download the data from Kaggle and store it
# in the same folder as this notebook.
data = pd.read_csv('houseprice.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# Separate the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0,
)
X_train.shape, X_test.shape
((1022, 79), (438, 79))
We can drop observations that show NA in any of a subset of variables.
# Drop data when there are NA in any of the indicated variables
imputer = DropMissingData(
variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],
missing_only=False,
)
imputer.fit(X_train)
DropMissingData(missing_only=False, variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])
# variables from which observations with NA will be deleted
imputer.variables_
['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']
# Number of observations with NA before the transformation
X_train[imputer.variables].isna().sum()
Alley 960 MasVnrType 5 LotFrontage 189 MasVnrArea 5 dtype: int64
# After the transformation the rows with NA values are
# deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Number of observations with NA after transformation
train_t[imputer.variables].isna().sum()
Alley 0 MasVnrType 0 LotFrontage 0 MasVnrArea 0 dtype: int64
# Shape of dataframe before transformation
X_train.shape
(1022, 79)
# Shape of dataframe after transformation
train_t.shape
(59, 79)
# The "return_na_data()" method, returns a dataframe that contains
# the observations with NA.
# That is, the portion of the data that is dropped when
# we apply the transform() method.
tmp = imputer.return_na_data(X_train)
tmp.shape
(963, 79)
# total obs - obs with NA = final dataframe shape
# after the transformation
1022-963
59
Sometimes, it is useful to retain the observation with NA in the production environment, to log which observations are not being scored by the model for example.
We can drop observations if they contain less than a required percentage of values in a subset of observations.
# Drop data if an observation contains NA in
# 2 of the 4 indicated variables (50%).
imputer = DropMissingData(
variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],
missing_only=False,
threshold=0.5,
)
imputer.fit(X_train)
DropMissingData(missing_only=False, threshold=0.5, variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])
# After the transformation the rows with NA values are
# deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Number of observations with NA after transformation
train_t[imputer.variables].isna().sum()
Alley 955 MasVnrType 0 LotFrontage 188 MasVnrArea 0 dtype: int64
We see that not all missing observations were dropped, because we required the observation to have NA in more than 1 of the variables at the time.
We can drop obserations if they show NA in any variable in the dataset.
When the parameter variables
is left to None and the parameter missing_only
is left to True, the imputer will evaluate observations based of all variables with missing data.
When the parameter variables
is left to None and the parameter missing_only
is switched to False, the imputer will evaluate observations based of all variables.
It is good practice to use missing_only=True
when we set variables=None
, so that the transformer handles the imputation automatically in a meaningful way.
# Find variables with NA
imputer = DropMissingData(missing_only=True)
imputer.fit(X_train)
DropMissingData()
# variables with NA in the train set
imputer.variables_
['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
# Number of observations with NA
X_train[imputer.variables_].isna().sum()
LotFrontage 189 Alley 960 MasVnrType 5 MasVnrArea 5 BsmtQual 24 BsmtCond 24 BsmtExposure 24 BsmtFinType1 24 BsmtFinType2 25 Electrical 1 FireplaceQu 478 GarageType 54 GarageYrBlt 54 GarageFinish 54 GarageQual 54 GarageCond 54 PoolQC 1019 Fence 831 MiscFeature 978 dtype: int64
# After the transformation the rows with NA are deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Number of observations with NA after the transformation
train_t[imputer.variables_].isna().sum()
LotFrontage 0.0 Alley 0.0 MasVnrType 0.0 MasVnrArea 0.0 BsmtQual 0.0 BsmtCond 0.0 BsmtExposure 0.0 BsmtFinType1 0.0 BsmtFinType2 0.0 Electrical 0.0 FireplaceQu 0.0 GarageType 0.0 GarageYrBlt 0.0 GarageFinish 0.0 GarageQual 0.0 GarageCond 0.0 PoolQC 0.0 Fence 0.0 MiscFeature 0.0 dtype: float64
# in this case, all observations will be dropped
# because all of them show NA at least in 1 variable
train_t.shape
(0, 79)
Not to end up with an empty dataframe, let's drop rows that have less than 75% of the variables with values.
# Find variables with NA
imputer = DropMissingData(
missing_only=True,
threshold=0.75,
)
imputer.fit(X_train)
DropMissingData(threshold=0.75)
# After the transformation the rows with NA are deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
train_t.shape
(1022, 79)
Now, we do have some data left.