Deletes rows with missing values.

DropMissingData works both with numerical and categorical variables.

**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**

The version of the dataset used in this notebook can be obtained from Kaggle

In [1]:

```
# Make sure you are using this
# Feature-engine version.
import feature_engine
feature_engine.__version__
```

Out[1]:

'1.2.0'

In [2]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import DropMissingData
```

In [3]:

```
# Download the data from Kaggle and store it
# in the same folder as this notebook.
data = pd.read_csv('houseprice.csv')
data.head()
```

Out[3]:

Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |

1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |

2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |

3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |

4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |

5 rows × 81 columns

In [4]:

```
# Separate the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0,
)
X_train.shape, X_test.shape
```

Out[4]:

((1022, 79), (438, 79))

We can drop observations that show NA in any of a subset of variables.

In [5]:

```
# Drop data when there are NA in any of the indicated variables
imputer = DropMissingData(
variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],
missing_only=False,
)
```

In [6]:

```
imputer.fit(X_train)
```

Out[6]:

DropMissingData(missing_only=False, variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

In [7]:

```
# variables from which observations with NA will be deleted
imputer.variables_
```

Out[7]:

['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']

In [8]:

```
# Number of observations with NA before the transformation
X_train[imputer.variables].isna().sum()
```

Out[8]:

Alley 960 MasVnrType 5 LotFrontage 189 MasVnrArea 5 dtype: int64

In [9]:

```
# After the transformation the rows with NA values are
# deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
```

In [10]:

```
# Number of observations with NA after transformation
train_t[imputer.variables].isna().sum()
```

Out[10]:

Alley 0 MasVnrType 0 LotFrontage 0 MasVnrArea 0 dtype: int64

In [11]:

```
# Shape of dataframe before transformation
X_train.shape
```

Out[11]:

(1022, 79)

In [12]:

```
# Shape of dataframe after transformation
train_t.shape
```

Out[12]:

(59, 79)

In [13]:

```
# The "return_na_data()" method, returns a dataframe that contains
# the observations with NA.
# That is, the portion of the data that is dropped when
# we apply the transform() method.
tmp = imputer.return_na_data(X_train)
tmp.shape
```

Out[13]:

(963, 79)

In [14]:

```
# total obs - obs with NA = final dataframe shape
# after the transformation
1022-963
```

Out[14]:

59

We can drop observations if they contain less than a required percentage of values in a subset of observations.

In [15]:

```
# Drop data if an observation contains NA in
# 2 of the 4 indicated variables (50%).
imputer = DropMissingData(
variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],
missing_only=False,
threshold=0.5,
)
```

In [16]:

```
imputer.fit(X_train)
```

Out[16]:

DropMissingData(missing_only=False, threshold=0.5, variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'])

In [17]:

```
# After the transformation the rows with NA values are
# deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
```

In [18]:

```
# Number of observations with NA after transformation
train_t[imputer.variables].isna().sum()
```

Out[18]:

Alley 955 MasVnrType 0 LotFrontage 188 MasVnrArea 0 dtype: int64

We can drop obserations if they show NA in any variable in the dataset.

When the parameter `variables`

is left to None and the parameter `missing_only`

is left to True, the imputer will evaluate observations based of all variables with missing data.

When the parameter `variables`

is left to None and the parameter `missing_only`

is switched to False, the imputer will evaluate observations based of all variables.

It is good practice to use `missing_only=True`

when we set `variables=None`

, so that the transformer handles the imputation automatically in a meaningful way.

In [19]:

```
# Find variables with NA
imputer = DropMissingData(missing_only=True)
imputer.fit(X_train)
```

Out[19]:

DropMissingData()

In [20]:

```
# variables with NA in the train set
imputer.variables_
```

Out[20]:

['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

In [21]:

```
# Number of observations with NA
X_train[imputer.variables_].isna().sum()
```

Out[21]:

LotFrontage 189 Alley 960 MasVnrType 5 MasVnrArea 5 BsmtQual 24 BsmtCond 24 BsmtExposure 24 BsmtFinType1 24 BsmtFinType2 25 Electrical 1 FireplaceQu 478 GarageType 54 GarageYrBlt 54 GarageFinish 54 GarageQual 54 GarageCond 54 PoolQC 1019 Fence 831 MiscFeature 978 dtype: int64

In [22]:

```
# After the transformation the rows with NA are deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
```

In [23]:

```
# Number of observations with NA after the transformation
train_t[imputer.variables_].isna().sum()
```

Out[23]:

LotFrontage 0.0 Alley 0.0 MasVnrType 0.0 MasVnrArea 0.0 BsmtQual 0.0 BsmtCond 0.0 BsmtExposure 0.0 BsmtFinType1 0.0 BsmtFinType2 0.0 Electrical 0.0 FireplaceQu 0.0 GarageType 0.0 GarageYrBlt 0.0 GarageFinish 0.0 GarageQual 0.0 GarageCond 0.0 PoolQC 0.0 Fence 0.0 MiscFeature 0.0 dtype: float64

In [24]:

```
# in this case, all observations will be dropped
# because all of them show NA at least in 1 variable
train_t.shape
```

Out[24]:

(0, 79)

Not to end up with an empty dataframe, let's drop rows that have less than 75% of the variables with values.

In [25]:

```
# Find variables with NA
imputer = DropMissingData(
missing_only=True,
threshold=0.75,
)
imputer.fit(X_train)
```

Out[25]:

DropMissingData(threshold=0.75)

In [26]:

```
# After the transformation the rows with NA are deleted form the dataframe
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
```

In [27]:

```
train_t.shape
```

Out[27]:

(1022, 79)

Now, we do have some data left.

In [ ]:

```
```