ArbitraryNumberImputer replaces NA by an arbitrary value. It works for numerical variables. The arbitrary value needs to be defined by the user.
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
The version of the dataset used in this notebook can be obtained from Kaggle
# Make sure you are using this
# Feature-engine version.
import feature_engine
feature_engine.__version__
'1.2.0'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import ArbitraryNumberImputer
# Download the data from Kaggle and store it
# in the same folder as this notebook.
data = pd.read_csv('houseprice.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# Separate the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0,
)
X_train.shape, X_test.shape
((1022, 79), (438, 79))
We will impute 2 numerical variables with the number 999.
# Check missing data
X_train[['LotFrontage', 'MasVnrArea']].isnull().mean()
LotFrontage 0.184932 MasVnrArea 0.004892 dtype: float64
# Let's create an instance of the imputer where we impute
# 2 variables with the same arbitraty number.
imputer = ArbitraryNumberImputer(
arbitrary_number=-999,
variables=['LotFrontage', 'MasVnrArea'],
)
imputer.fit(X_train)
ArbitraryNumberImputer(arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea'])
# The number to use in the imputation
# is stored as parameter.
imputer.arbitrary_number
-999
# The imputer will use the same value to impute
# all indicated variables.
imputer.imputer_dict_
{'LotFrontage': -999, 'MasVnrArea': -999}
# Impute variables
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Sanity check: the min value is the one used for
# the imputation
train_t[['LotFrontage', 'MasVnrArea']].min()
LotFrontage -999.0 MasVnrArea -999.0 dtype: float64
# The distribution of the variable
# changed with the transformation.
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
<matplotlib.legend.Legend at 0xadfa9689a0>
We can also impute different variables with different values. In this case, we need to start the transformer with a dictionary of variable to value pairs.
# Impute different variables with different values
imputer = ArbitraryNumberImputer(
imputer_dict={"LotFrontage": -678, "MasVnrArea": -789}
)
imputer.fit(X_train)
ArbitraryNumberImputer(imputer_dict={'LotFrontage': -678, 'MasVnrArea': -789})
# In this case, the imputer_dict_ matches the
# entered dictionary.
imputer.imputer_dict_
{'LotFrontage': -678, 'MasVnrArea': -789}
# Now we impute the missing data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Sanity check: check minimum values
train_t[['LotFrontage', 'MasVnrArea']].min()
LotFrontage -678.0 MasVnrArea -789.0 dtype: float64
# The distribution of the variable changed
# after the transformation.
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
<matplotlib.legend.Legend at 0xadfcaec1c0>
We can impute all numerical variables with the same value automatically with this transformer. We need to leave the parameter variables
to None.
# Let's create an instance of the imputer where we impute
# 2 variables with the same arbitraty number.
imputer = ArbitraryNumberImputer(
arbitrary_number=-1,
)
imputer.fit(X_train)
ArbitraryNumberImputer(arbitrary_number=-1)
# The imputer finds all numerical variables
# automatically.
imputer.variables_
['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
# We find the imputation value in the dictionary
imputer.imputer_dict_
{'MSSubClass': -1, 'LotFrontage': -1, 'LotArea': -1, 'OverallQual': -1, 'OverallCond': -1, 'YearBuilt': -1, 'YearRemodAdd': -1, 'MasVnrArea': -1, 'BsmtFinSF1': -1, 'BsmtFinSF2': -1, 'BsmtUnfSF': -1, 'TotalBsmtSF': -1, '1stFlrSF': -1, '2ndFlrSF': -1, 'LowQualFinSF': -1, 'GrLivArea': -1, 'BsmtFullBath': -1, 'BsmtHalfBath': -1, 'FullBath': -1, 'HalfBath': -1, 'BedroomAbvGr': -1, 'KitchenAbvGr': -1, 'TotRmsAbvGrd': -1, 'Fireplaces': -1, 'GarageYrBlt': -1, 'GarageCars': -1, 'GarageArea': -1, 'WoodDeckSF': -1, 'OpenPorchSF': -1, 'EnclosedPorch': -1, '3SsnPorch': -1, 'ScreenPorch': -1, 'PoolArea': -1, 'MiscVal': -1, 'MoSold': -1, 'YrSold': -1}
# now we impute the missing data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Sanity check:
# No numerical variable with NA is left in the
# transformed data.
[v for v in train_t.columns if train_t[v].dtypes !=
'O' and train_t[v].isnull().sum() > 1]
[]
# New: we can get the name of the features in the final output
imputer.get_feature_names_out()
['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']