The EndTailImputer() replaces missing data by a value at either tail of the distribution. It automatically determines the value to be used in the imputation using the mean plus or minus a factor of the standard deviation, or using the inter-quartile range proximity rule. Alternatively, it can use a factor of the maximum value.
The EndTailImputer() is in essence, very similar to the ArbitraryNumberImputer, but it selects the value to use fr the imputation automatically, instead of having the user pre-define them.
It works only with numerical variables.
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
The version of the dataset used in this notebook can be obtained from Kaggle
# Make sure you are using this
# Feature-engine version.
import feature_engine
feature_engine.__version__
'1.2.0'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import EndTailImputer
# Download the data from Kaggle and store it
# in the same folder as this notebook.
data = pd.read_csv('houseprice.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# Separate the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0,
)
X_train.shape, X_test.shape
((1022, 79), (438, 79))
# numerical variables with missing data
X_train[['LotFrontage', 'MasVnrArea']].isnull().mean()
LotFrontage 0.184932 MasVnrArea 0.004892 dtype: float64
The EndTailImputer can replace NA with a value at the left or right end of the distribution.
In addition, it uses 3 different methods to identify the imputation values.
In the following cells, we show how to use each method.
Let's begin by finding the values automatically at the right tail, by using the mean and the standard deviation.
imputer = EndTailImputer(
# uses mean and standard deviation to determine the value
imputation_method='gaussian',
# value at right tail of distribution
tail='right',
# multiply the std by 3
fold=3,
# the variables to impute
variables=['LotFrontage', 'MasVnrArea'],
)
# find the imputation values
imputer.fit(X_train)
EndTailImputer(variables=['LotFrontage', 'MasVnrArea'])
# The values for the imputation
imputer.imputer_dict_
{'LotFrontage': 138.9022201686726, 'MasVnrArea': 648.3947111415165}
Note that we use different values for different variables.
# impute the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# check we no longer have NA
train_t['LotFrontage'].isnull().sum()
0
# The variable distribution changed slightly with
# more values accumulating towards the right tail
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
<matplotlib.legend.Legend at 0xef4b0e6790>
Now, we will impute variables with values at the left tail. The values are identified using the inter-quartile range proximity rule.
The IQR rule is better suited for skewed variables.
imputer = EndTailImputer(
# uses the inter-quartile range proximity rule
imputation_method='iqr',
# determines values at the left tail of the distribution
tail='left',
# multiplies the IQR by 3
fold=3,
# the variables to impute
variables=['LotFrontage', 'MasVnrArea'],
)
# finds the imputation values
imputer.fit(X_train)
EndTailImputer(imputation_method='iqr', tail='left', variables=['LotFrontage', 'MasVnrArea'])
# imputation values per variable
imputer.imputer_dict_
{'LotFrontage': -8.0, 'MasVnrArea': -510.0}
# transform the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Check we have no NA after the transformation
train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()
LotFrontage 0 MasVnrArea 0 dtype: int64
# The variable distribution changed with the
# transformation, with more values
# accumulating towards the left tail.
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
<matplotlib.legend.Legend at 0xef4d27e9a0>
We can find imputation values with a factor of the maximum variable value.
imputer = EndTailImputer(
# imputes beyond the maximum value
imputation_method='max',
# multiplies the maximum value by 3
fold=3,
# the variables to impute
variables=['LotFrontage', 'MasVnrArea'],
)
# find imputation values
imputer.fit(X_train)
EndTailImputer(imputation_method='max', variables=['LotFrontage', 'MasVnrArea'])
# The imputation values.
imputer.imputer_dict_
{'LotFrontage': 939.0, 'MasVnrArea': 4800.0}
# the maximum values of the variables,
# note how the imputer multiplied them by 3
# to determine the imputation values.
X_train[imputer.variables_].max()
LotFrontage 313.0 MasVnrArea 1600.0 dtype: float64
# impute the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Check we have no NA in the imputed data
train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()
LotFrontage 0 MasVnrArea 0 dtype: int64
# The variable distribution changed with the
# transformation, with now more values
# beyond the maximum.
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
<matplotlib.legend.Legend at 0xef4d2fd520>
As with all Feature-engine transformers, the EndTailImputer can also find and impute all numerical variables in the data.
# Start the imputer
imputer = EndTailImputer()
# Check the default parameters
# how to find the imputation value
imputer.imputation_method
'gaussian'
# which tail to use
imputer.tail
'right'
# how far out
imputer.fold
3
# Find variables and imputation values
imputer.fit(X_train)
EndTailImputer()
# The variables to impute
imputer.variables_
['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
# The imputation values
imputer.imputer_dict_
{'MSSubClass': 183.0960051903714, 'LotFrontage': 138.9022201686726, 'LotArea': 41441.796589850215, 'OverallQual': 10.152919665538322, 'OverallCond': 8.918356149675976, 'YearBuilt': 2061.66731604675, 'YearRemodAdd': 2046.2161089423614, 'MasVnrArea': 648.3947111415165, 'BsmtFinSF1': 1732.0016007094835, 'BsmtFinSF2': 520.9882766560984, 'BsmtUnfSF': 1865.113698435333, 'TotalBsmtSF': 2286.0497168767233, '1stFlrSF': 2283.6805173062803, '2ndFlrSF': 1677.2392305771546, 'LowQualFinSF': 149.0787736885176, 'GrLivArea': 3075.569310556133, 'BsmtFullBath': 1.9636856192070633, 'BsmtHalfBath': 0.7637721815299992, 'FullBath': 3.2012740993879882, 'HalfBath': 1.877166869732324, 'BedroomAbvGr': 5.303758597292265, 'KitchenAbvGr': 1.7084277213255645, 'TotRmsAbvGrd': 11.395793721778118, 'Fireplaces': 2.519529226227064, 'GarageYrBlt': 2052.9707419772235, 'GarageCars': 3.966386813249906, 'GarageArea': 1095.8302008827814, 'WoodDeckSF': 480.04361090824267, 'OpenPorchSF': 250.26561495660084, 'EnclosedPorch': 216.43485488519244, '3SsnPorch': 89.5229867716376, 'ScreenPorch': 184.35773738383577, 'PoolArea': 101.82445982535369, 'MiscVal': 1817.7712851835915, 'MoSold': 14.42955308807171, 'YrSold': 2011.8643245428148}
# impute the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)
# Sanity check:
# No numerical variable with NA is left in the
# transformed data.
[v for v in train_t.columns if train_t[v].dtypes !=
'O' and train_t[v].isnull().sum() > 1]
[]