Missing value imputation: CategoricalImputer¶

CategoricalImputer performs imputation of categorical variables. It replaces missing values by an arbitrary label "Missing" (default) or any other label entered by the user. Alternatively, it imputes missing data with the most frequent category.

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

The version of the dataset used in this notebook can be obtained from Kaggle

Version¶

In [1]:

# Make sure you are using this 
# Feature-engine version.

import feature_engine

feature_engine.__version__

Out[1]:

'1.2.0'

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from  feature_engine.imputation import CategoricalImputer

Load data¶

In [3]:

# Download the data from Kaggle and store it
# in the same folder as this notebook.

data = pd.read_csv('houseprice.csv')

data.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

# Separate the data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

Out[4]:

((1022, 79), (438, 79))

Check missing data¶

In [5]:

# These are categorical variables with missing data

X_train[['Alley', 'MasVnrType']].isnull().mean()

Out[5]:

Alley         0.939335
MasVnrType    0.004892
dtype: float64

In [6]:

# Number of observations per category

X_train['MasVnrType'].value_counts().plot.bar()
plt.ylabel('Number of observations')
plt.title('MasVnrType')

Out[6]:

Text(0.5, 1.0, 'MasVnrType')

Imputat with string missing¶

We replace missing data with the string "Missing".

In [7]:

imputer = CategoricalImputer(
    imputation_method='missing',
    variables=['Alley', 'MasVnrType'])

imputer.fit(X_train)

Out[7]:

CategoricalImputer(variables=['Alley', 'MasVnrType'])

In [8]:

# We impute all variables with the
# string 'Missing'

imputer.imputer_dict_

Out[8]:

{'Alley': 'Missing', 'MasVnrType': 'Missing'}

In [9]:

# Perform imputation.

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [10]:

# Observe the new category 'Missing'

test_t['MasVnrType'].value_counts().plot.bar()

plt.ylabel('Number of observations')
plt.title('Imputed MasVnrType')

Out[10]:

Text(0.5, 1.0, 'Imputed MasVnrType')

In [11]:

test_t['Alley'].value_counts().plot.bar()

plt.ylabel('Number of observations')
plt.title('Imputed Alley')

Out[11]:

Text(0.5, 1.0, 'Imputed Alley')

Impute with another string¶

We can also enter a specific string for the imputation instead of the default 'Missing'.

In [12]:

imputer = CategoricalImputer(
    variables='MasVnrType',
    fill_value="this_is_missing",
)

In [13]:

# We can also fit and transform the train set
# in one line of code
train_t = imputer.fit_transform(X_train)

In [14]:

# and then transform the test set
test_t = imputer.transform(X_test)

In [15]:

# let's check the current imputation
# dictionary

imputer.imputer_dict_

Out[15]:

{'MasVnrType': 'this_is_missing'}

In [16]:

# After the imputation we see the new category

test_t['MasVnrType'].value_counts().plot.bar()

plt.ylabel('Number of observations')
plt.title('Imputed MasVnrType')

Out[16]:

Text(0.5, 1.0, 'Imputed MasVnrType')

Frequent Category Imputation¶

We can also replace missing values with the most frequent category.

In [17]:

imputer = CategoricalImputer(
    imputation_method='frequent',
    variables=['Alley', 'MasVnrType'],
)

In [18]:

# Find most frequent category

imputer.fit(X_train)

Out[18]:

CategoricalImputer(imputation_method='frequent',
                   variables=['Alley', 'MasVnrType'])

In [19]:

# In this attribute we find the most frequent category
# per variable to impute.

imputer.imputer_dict_

Out[19]:

{'Alley': 'Pave', 'MasVnrType': 'None'}

In [20]:

# Impute variables
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [21]:

# Let's count the number of observations per category
# in the original variable.

X_train['MasVnrType'].value_counts()

Out[21]:

None       609
BrkFace    301
Stone       97
BrkCmn      10
Name: MasVnrType, dtype: int64

In [22]:

# note that we have a few more observations in the 
# most frequent category, which for this variable
# is 'None', after the transformation.

train_t['MasVnrType'].value_counts()

Out[22]:

None       614
BrkFace    301
Stone       97
BrkCmn      10
Name: MasVnrType, dtype: int64

The number of observations for None in MasVnrType increased from 609 to 614, thanks to replacing the NA with this label.

Automatically select categorical variables¶

We can impute all catetgorical variables automatically, either with a string or with the most frequent category.

To do so, we need to leave the parameter variables to None.

In [23]:

# Impute all categorical variables with 
# the most frequent category

imputer = CategoricalImputer(imputation_method='frequent')

In [24]:

# with fit, the transformer identifies the categorical variables
# in the train set, and their most frequent category.
imputer.fit(X_train)

# Here we find the imputation values for each
# categorical variable.

imputer.imputer_dict_

Out[24]:

{'MSZoning': 'RL',
 'Street': 'Pave',
 'Alley': 'Pave',
 'LotShape': 'Reg',
 'LandContour': 'Lvl',
 'Utilities': 'AllPub',
 'LotConfig': 'Inside',
 'LandSlope': 'Gtl',
 'Neighborhood': 'NAmes',
 'Condition1': 'Norm',
 'Condition2': 'Norm',
 'BldgType': '1Fam',
 'HouseStyle': '1Story',
 'RoofStyle': 'Gable',
 'RoofMatl': 'CompShg',
 'Exterior1st': 'VinylSd',
 'Exterior2nd': 'VinylSd',
 'MasVnrType': 'None',
 'ExterQual': 'TA',
 'ExterCond': 'TA',
 'Foundation': 'PConc',
 'BsmtQual': 'TA',
 'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'Heating': 'GasA',
 'HeatingQC': 'Ex',
 'CentralAir': 'Y',
 'Electrical': 'SBrkr',
 'KitchenQual': 'TA',
 'Functional': 'Typ',
 'FireplaceQu': 'Gd',
 'GarageType': 'Attchd',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageCond': 'TA',
 'PavedDrive': 'Y',
 'PoolQC': 'Gd',
 'Fence': 'MnPrv',
 'MiscFeature': 'Shed',
 'SaleType': 'WD',
 'SaleCondition': 'Normal'}

In [25]:

# With transform we replace missing data.

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [26]:

# Sanity check:

# No categorical variable with NA is left in the
# transformed data.

[v for v in train_t.columns if train_t[v].dtypes ==
    'O' and train_t[v].isnull().sum() > 1]

Out[26]:

[]

In [27]:

# We can also return the name of the final features in
# the transformed data
imputer.get_feature_names_out()

Out[27]:

['MSSubClass',
 'MSZoning',
 'LotFrontage',
 'LotArea',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'KitchenQual',
 'TotRmsAbvGrd',
 'Functional',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MiscVal',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition']