AddMissingIndicator¶

AddMissingIndicator adds additional binary variables indicating missing data (thus, called missing indicators). The binary variables take the value 1 if the observation's value is missing, or 0 otherwise. AddMissingIndicator adds 1 binary variable per variable.

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

The version of the dataset used in this notebook can be obtained from Kaggle

Version¶

In [1]:

# Make sure you are using this 
# Feature-engine version.

import feature_engine

feature_engine.__version__

Out[1]:

'1.2.0'

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

Load data¶

In [3]:

# Download the data from Kaggle and store it
# in the same folder as this notebook.

data = pd.read_csv('houseprice.csv')

data.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

# Separate the data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

Out[4]:

((1022, 79), (438, 79))

Add indicators¶

We will add indicators to 4 variables with missing data.

In [5]:

# Check missing data

X_train[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].isnull().mean()

Out[5]:

Alley          0.939335
MasVnrType     0.004892
LotFrontage    0.184932
MasVnrArea     0.004892
dtype: float64

In [6]:

# Start the imputer with the variables for which
# we want indicators.

imputer = AddMissingIndicator(
    variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],
)

imputer.fit(X_train)

Out[6]:

AddMissingIndicator(variables=['Alley', 'MasVnrType', 'LotFrontage',
                               'MasVnrArea'])

In [7]:

# the variables for which missing 
# indicators will be added.

imputer.variables_

Out[7]:

['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']

In [8]:

# Check the added indicators. They take the name of
# the variable underscore na

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].head()

Out[8]:

	Alley_na	LotFrontage_na
64	1	1
682	1	1
960	1	0
1384	1	0
1100	1	0

In [9]:

# Note that the original variables still have missing data.

train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].mean()

Out[9]:

Alley_na          0.939335
MasVnrType_na     0.004892
LotFrontage_na    0.184932
MasVnrArea_na     0.004892
dtype: float64

Indicators plus imputation¶

We normally add missing indicators and impute the original variables with the mean or median if the variable is numerical, or with the mode if the variable is categorical. So let's do that.

In [10]:

# Check variable types

X_train[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].dtypes

Out[10]:

Alley           object
MasVnrType      object
LotFrontage    float64
MasVnrArea     float64
dtype: object

The first 2 variables are categorical, so I will impute them with the most frequent category. The last variables are numerical, so I will impute with the median.

In [11]:

# Create a pipeline with the imputation strategy

pipe = Pipeline([
    ('indicators', AddMissingIndicator(
        variables=['Alley', 'MasVnrType',
                   'LotFrontage', 'MasVnrArea'],
    )),

    ('imputer_num', MeanMedianImputer(
        imputation_method='median',
        variables=['LotFrontage', 'MasVnrArea'],
    )),

    ('imputer_cat', CategoricalImputer(
        imputation_method='frequent',
        variables=['Alley', 'MasVnrType'],
    )),
])

In [12]:

# With fit() the transformers learn the 
# required parameters.

pipe.fit(X_train)

Out[12]:

Pipeline(steps=[('indicators',
                 AddMissingIndicator(variables=['Alley', 'MasVnrType',
                                                'LotFrontage', 'MasVnrArea'])),
                ('imputer_num',
                 MeanMedianImputer(variables=['LotFrontage', 'MasVnrArea'])),
                ('imputer_cat',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['Alley', 'MasVnrType']))])

In [13]:

# We can look into the attributes of the
# different transformers.

# Check the variables that will take indicators.
pipe.named_steps['indicators'].variables_

Out[13]:

['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']

In [14]:

# Check the median values for the imputation.

pipe.named_steps['imputer_num'].imputer_dict_

Out[14]:

{'LotFrontage': 69.0, 'MasVnrArea': 0.0}

In [15]:

# Check the mode values for the imputation.

pipe.named_steps['imputer_cat'].imputer_dict_

Out[15]:

{'Alley': 'Pave', 'MasVnrType': 'None'}

In [16]:

# Now, we transform the data.

train_t = pipe.transform(X_train)
test_t = pipe.transform(X_test)

In [17]:

# Lets' look at the transformed variables.

# original variables plus indicators
vars_ = ['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea',
         'Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']

train_t[vars_].head()

Out[17]:

	Alley	MasVnrType	LotFrontage	MasVnrArea	Alley_na	LotFrontage_na
64	Pave	BrkFace	69.0	573.0	1	1
682	Pave	None	69.0	0.0	1	1
960	Pave	None	50.0	0.0	1	0
1384	Pave	None	60.0	0.0	1	0
1100	Pave	None	60.0	0.0	1	0

In [18]:

# After the transformation, the variables do not
# show missing data

train_t[vars_].isnull().sum()

Out[18]:

Alley             0
MasVnrType        0
LotFrontage       0
MasVnrArea        0
Alley_na          0
MasVnrType_na     0
LotFrontage_na    0
MasVnrArea_na     0
dtype: int64

Automatically select the variables¶

We have the option to add indicators to all variables in the dataset, or to all variables with missing data. AddMissingIndicator can select which variables to transform automatically.

When the parameter variables is left to None and the parameter missing_only is left to True, the imputer add indicators to all variables with missing data.

When the parameter variables is left to None and the parameter missing_only is switched to False, the imputer add indicators to all variables.

It is good practice to use missing_only=True when we set variables=None, so that the transformer handles the imputation automatically in a meaningful way.

Automatically find variables with NA¶

In [19]:

# With missing_only=True, missing indicators will only be added
# to those variables with missing data found during the fit method
# in the train set


imputer = AddMissingIndicator(
    variables=None,
    missing_only=True,
)

# finds variables with missing data
imputer.fit(X_train)

Out[19]:

AddMissingIndicator()

In [20]:

# The original variables argument was None

imputer.variables

In [21]:

# In variables_ we find the list of variables with NA
# in the train set

imputer.variables_

Out[21]:

['LotFrontage',
 'Alley',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [22]:

len(imputer.variables_)

Out[22]:

We've got 19 variables with NA in the train set.

In [23]:

# After transforming the dataset, we see more columns
# corresponding to the missing indicators.

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

X_train.shape, train_t.shape

Out[23]:

((1022, 79), (1022, 98))

In [24]:

# Towards the right, we find the missing indicators.

train_t.head()

Out[24]:

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	FireplaceQu_na	GarageType_na	GarageYrBlt_na	GarageFinish_na	GarageQual_na	GarageCond_na	PoolQC_na	Fence_na	MiscFeature_na
64	60	RL	NaN	9375	Pave	NaN	Reg	Lvl	AllPub	Inside	...	1	0	0	0	0	0	1	0	1
682	120	RL	NaN	2887	Pave	NaN	Reg	HLS	AllPub	Inside	...	0	0	0	0	0	0	1	1	1
960	20	RL	50.0	7207	Pave	NaN	IR1	Lvl	AllPub	Inside	...	1	1	1	1	1	1	1	1	1
1384	50	RL	60.0	9060	Pave	NaN	Reg	Lvl	AllPub	Inside	...	1	0	0	0	0	0	1	0	1
1100	30	RL	60.0	8400	Pave	NaN	Reg	Bnk	AllPub	Inside	...	1	0	0	0	0	0	1	1	1

5 rows × 98 columns

Add indicators to all variables¶

In [25]:

# We can, in practice, set up the indicator to add
# missing indicators to all variables

imputer = AddMissingIndicator(
    variables=None,
    missing_only=False,
)

imputer.fit(X_train)

Out[25]:

AddMissingIndicator(missing_only=False)

In [26]:

# the attribute variables_ now shows all variables
# in the train set.

len(imputer.variables_)

Out[26]:

In [27]:

# After transforming the dataset,
# we obtain double the number of columns

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

X_train.shape, train_t.shape

Out[27]:

((1022, 79), (1022, 158))

Automatic imputation¶

We can automatically impute missing data in numerical and categorical variables, letting the imputers find out which variables to impute.

We need to set the parameter variables to None in all imputers. None is the default value, so we can simply omit the parameter when initialising the transformers.

In [28]:

# Create a pipeline with the imputation strategy

pipe = Pipeline([
    
    # add indicators to variables with NA
    ('indicators', AddMissingIndicator(
        missing_only=True,
    )),

    # impute all numerical variables with the median
    ('imputer_num', MeanMedianImputer(
        imputation_method='median',
    )),

    # impute all categorical variables with the mode
    ('imputer_cat', CategoricalImputer(
        imputation_method='frequent',
    )),
])

In [29]:

# With fit() the transformers learn the 
# required parameters.

pipe.fit(X_train)

Out[29]:

Pipeline(steps=[('indicators', AddMissingIndicator()),
                ('imputer_num', MeanMedianImputer()),
                ('imputer_cat',
                 CategoricalImputer(imputation_method='frequent'))])

In [30]:

# We can look into the attributes of the
# different transformers.

# Check the variables that will take indicators.
pipe.named_steps['indicators'].variables_

Out[30]:

['LotFrontage',
 'Alley',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [31]:

# Check the median values for the imputation.

pipe.named_steps['imputer_num'].imputer_dict_

Out[31]:

{'MSSubClass': 50.0,
 'LotFrontage': 69.0,
 'LotArea': 9536.0,
 'OverallQual': 6.0,
 'OverallCond': 5.0,
 'YearBuilt': 1972.0,
 'YearRemodAdd': 1993.0,
 'MasVnrArea': 0.0,
 'BsmtFinSF1': 386.0,
 'BsmtFinSF2': 0.0,
 'BsmtUnfSF': 486.5,
 'TotalBsmtSF': 992.0,
 '1stFlrSF': 1095.0,
 '2ndFlrSF': 0.0,
 'LowQualFinSF': 0.0,
 'GrLivArea': 1479.0,
 'BsmtFullBath': 0.0,
 'BsmtHalfBath': 0.0,
 'FullBath': 2.0,
 'HalfBath': 0.0,
 'BedroomAbvGr': 3.0,
 'KitchenAbvGr': 1.0,
 'TotRmsAbvGrd': 6.0,
 'Fireplaces': 1.0,
 'GarageYrBlt': 1979.0,
 'GarageCars': 2.0,
 'GarageArea': 477.0,
 'WoodDeckSF': 0.0,
 'OpenPorchSF': 25.0,
 'EnclosedPorch': 0.0,
 '3SsnPorch': 0.0,
 'ScreenPorch': 0.0,
 'PoolArea': 0.0,
 'MiscVal': 0.0,
 'MoSold': 6.0,
 'YrSold': 2008.0,
 'LotFrontage_na': 0.0,
 'Alley_na': 1.0,
 'MasVnrType_na': 0.0,
 'MasVnrArea_na': 0.0,
 'BsmtQual_na': 0.0,
 'BsmtCond_na': 0.0,
 'BsmtExposure_na': 0.0,
 'BsmtFinType1_na': 0.0,
 'BsmtFinType2_na': 0.0,
 'Electrical_na': 0.0,
 'FireplaceQu_na': 0.0,
 'GarageType_na': 0.0,
 'GarageYrBlt_na': 0.0,
 'GarageFinish_na': 0.0,
 'GarageQual_na': 0.0,
 'GarageCond_na': 0.0,
 'PoolQC_na': 1.0,
 'Fence_na': 1.0,
 'MiscFeature_na': 1.0}

In [32]:

# Check the mode values for the imputation.

pipe.named_steps['imputer_cat'].imputer_dict_

Out[32]:

{'MSZoning': 'RL',
 'Street': 'Pave',
 'Alley': 'Pave',
 'LotShape': 'Reg',
 'LandContour': 'Lvl',
 'Utilities': 'AllPub',
 'LotConfig': 'Inside',
 'LandSlope': 'Gtl',
 'Neighborhood': 'NAmes',
 'Condition1': 'Norm',
 'Condition2': 'Norm',
 'BldgType': '1Fam',
 'HouseStyle': '1Story',
 'RoofStyle': 'Gable',
 'RoofMatl': 'CompShg',
 'Exterior1st': 'VinylSd',
 'Exterior2nd': 'VinylSd',
 'MasVnrType': 'None',
 'ExterQual': 'TA',
 'ExterCond': 'TA',
 'Foundation': 'PConc',
 'BsmtQual': 'TA',
 'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'Heating': 'GasA',
 'HeatingQC': 'Ex',
 'CentralAir': 'Y',
 'Electrical': 'SBrkr',
 'KitchenQual': 'TA',
 'Functional': 'Typ',
 'FireplaceQu': 'Gd',
 'GarageType': 'Attchd',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageCond': 'TA',
 'PavedDrive': 'Y',
 'PoolQC': 'Gd',
 'Fence': 'MnPrv',
 'MiscFeature': 'Shed',
 'SaleType': 'WD',
 'SaleCondition': 'Normal'}

In [33]:

# Now, we transform the data.

train_t = pipe.transform(X_train)
test_t = pipe.transform(X_test)

In [34]:

# We should see a complete case dataset

train_t.isnull().sum()

Out[34]:

MSSubClass        0
MSZoning          0
LotFrontage       0
LotArea           0
Street            0
                 ..
GarageQual_na     0
GarageCond_na     0
PoolQC_na         0
Fence_na          0
MiscFeature_na    0
Length: 98, dtype: int64

In [35]:

# Sanity check

[v for v in train_t.columns if train_t[v].isnull().sum() > 1]

Out[35]:

[]

In [ ]: