EndTailImputer¶

The EndTailImputer() replaces missing data by a value at either tail of the distribution. It automatically determines the value to be used in the imputation using the mean plus or minus a factor of the standard deviation, or using the inter-quartile range proximity rule. Alternatively, it can use a factor of the maximum value.

The EndTailImputer() is in essence, very similar to the ArbitraryNumberImputer, but it selects the value to use fr the imputation automatically, instead of having the user pre-define them.

It works only with numerical variables.

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

The version of the dataset used in this notebook can be obtained from Kaggle

Version¶

In [1]:

# Make sure you are using this 
# Feature-engine version.

import feature_engine

feature_engine.__version__

Out[1]:

'1.2.0'

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.imputation import EndTailImputer

Load data¶

In [3]:

# Download the data from Kaggle and store it
# in the same folder as this notebook.

data = pd.read_csv('houseprice.csv')

data.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

# Separate the data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

Out[4]:

((1022, 79), (438, 79))

Check missing data¶

In [5]:

# numerical variables with missing data

X_train[['LotFrontage', 'MasVnrArea']].isnull().mean()

Out[5]:

LotFrontage    0.184932
MasVnrArea     0.004892
dtype: float64

The EndTailImputer can replace NA with a value at the left or right end of the distribution.

In addition, it uses 3 different methods to identify the imputation values.

In the following cells, we show how to use each method.

Gaussian, right tail¶

Let's begin by finding the values automatically at the right tail, by using the mean and the standard deviation.

In [6]:

imputer = EndTailImputer(
    
    # uses mean and standard deviation to determine the value
    imputation_method='gaussian',
    
    # value at right tail of distribution
    tail='right',
    
    # multiply the std by 3
    fold=3,
    
    # the variables to impute
    variables=['LotFrontage', 'MasVnrArea'],
)

In [7]:

# find the imputation values

imputer.fit(X_train)

Out[7]:

EndTailImputer(variables=['LotFrontage', 'MasVnrArea'])

In [8]:

# The values for the imputation

imputer.imputer_dict_

Out[8]:

{'LotFrontage': 138.9022201686726, 'MasVnrArea': 648.3947111415165}

Note that we use different values for different variables.

In [9]:

# impute the data

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [10]:

# check we no longer have NA

train_t['LotFrontage'].isnull().sum()

Out[10]:

In [11]:

# The variable distribution changed slightly with
# more values accumulating towards the right tail

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[11]:

<matplotlib.legend.Legend at 0xef4b0e6790>

IQR, left tail¶

Now, we will impute variables with values at the left tail. The values are identified using the inter-quartile range proximity rule.

The IQR rule is better suited for skewed variables.

In [12]:

imputer = EndTailImputer(
    
    # uses the inter-quartile range proximity rule
    imputation_method='iqr',
    
    # determines values at the left tail of the distribution
    tail='left',
    
    # multiplies the IQR by 3
    fold=3,
    
    # the variables to impute
    variables=['LotFrontage', 'MasVnrArea'],
)

In [13]:

# finds the imputation values

imputer.fit(X_train)

Out[13]:

EndTailImputer(imputation_method='iqr', tail='left',
               variables=['LotFrontage', 'MasVnrArea'])

In [14]:

# imputation values per variable

imputer.imputer_dict_

Out[14]:

{'LotFrontage': -8.0, 'MasVnrArea': -510.0}

In [15]:

# transform the data

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [16]:

# Check we have no NA after the transformation

train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()

Out[16]:

LotFrontage    0
MasVnrArea     0
dtype: int64

In [17]:

# The variable distribution changed with the
# transformation, with more values
# accumulating towards the left tail.

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[17]:

<matplotlib.legend.Legend at 0xef4d27e9a0>

Impute with the maximum value¶

We can find imputation values with a factor of the maximum variable value.

In [18]:

imputer = EndTailImputer(
    
    # imputes beyond the maximum value
    imputation_method='max',
    
    # multiplies the maximum value by 3
    fold=3,
    
    # the variables to impute
    variables=['LotFrontage', 'MasVnrArea'],
)

In [19]:

# find imputation values

imputer.fit(X_train)

Out[19]:

EndTailImputer(imputation_method='max', variables=['LotFrontage', 'MasVnrArea'])

In [20]:

# The imputation values.

imputer.imputer_dict_

Out[20]:

{'LotFrontage': 939.0, 'MasVnrArea': 4800.0}

In [21]:

# the maximum values of the variables,
# note how the imputer multiplied them by 3
# to determine the imputation values.

X_train[imputer.variables_].max()

Out[21]:

LotFrontage     313.0
MasVnrArea     1600.0
dtype: float64

In [22]:

# impute the data

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [23]:

# Check we have no NA  in the imputed data

train_t[['LotFrontage', 'MasVnrArea']].isnull().sum()

Out[23]:

LotFrontage    0
MasVnrArea     0
dtype: int64

In [24]:

# The variable distribution changed with the
# transformation, with now more values
# beyond the maximum.

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[24]:

<matplotlib.legend.Legend at 0xef4d2fd520>

Automatically impute all variables¶

As with all Feature-engine transformers, the EndTailImputer can also find and impute all numerical variables in the data.

In [25]:

# Start the imputer

imputer = EndTailImputer()

In [26]:

# Check the default parameters

# how to find the imputation value
imputer.imputation_method

Out[26]:

'gaussian'

In [27]:

# which tail to use

imputer.tail

Out[27]:

'right'

In [28]:

# how far out
imputer.fold

Out[28]:

In [29]:

# Find variables and imputation values

imputer.fit(X_train)

Out[29]:

EndTailImputer()

In [30]:

# The variables to impute

imputer.variables_

Out[30]:

['MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold']

In [31]:

#  The imputation values

imputer.imputer_dict_

Out[31]:

{'MSSubClass': 183.0960051903714,
 'LotFrontage': 138.9022201686726,
 'LotArea': 41441.796589850215,
 'OverallQual': 10.152919665538322,
 'OverallCond': 8.918356149675976,
 'YearBuilt': 2061.66731604675,
 'YearRemodAdd': 2046.2161089423614,
 'MasVnrArea': 648.3947111415165,
 'BsmtFinSF1': 1732.0016007094835,
 'BsmtFinSF2': 520.9882766560984,
 'BsmtUnfSF': 1865.113698435333,
 'TotalBsmtSF': 2286.0497168767233,
 '1stFlrSF': 2283.6805173062803,
 '2ndFlrSF': 1677.2392305771546,
 'LowQualFinSF': 149.0787736885176,
 'GrLivArea': 3075.569310556133,
 'BsmtFullBath': 1.9636856192070633,
 'BsmtHalfBath': 0.7637721815299992,
 'FullBath': 3.2012740993879882,
 'HalfBath': 1.877166869732324,
 'BedroomAbvGr': 5.303758597292265,
 'KitchenAbvGr': 1.7084277213255645,
 'TotRmsAbvGrd': 11.395793721778118,
 'Fireplaces': 2.519529226227064,
 'GarageYrBlt': 2052.9707419772235,
 'GarageCars': 3.966386813249906,
 'GarageArea': 1095.8302008827814,
 'WoodDeckSF': 480.04361090824267,
 'OpenPorchSF': 250.26561495660084,
 'EnclosedPorch': 216.43485488519244,
 '3SsnPorch': 89.5229867716376,
 'ScreenPorch': 184.35773738383577,
 'PoolArea': 101.82445982535369,
 'MiscVal': 1817.7712851835915,
 'MoSold': 14.42955308807171,
 'YrSold': 2011.8643245428148}

In [32]:

# impute the data

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [33]:

# Sanity check:

# No numerical variable with NA is  left in the
# transformed data.

[v for v in train_t.columns if train_t[v].dtypes !=
    'O' and train_t[v].isnull().sum() > 1]

Out[33]:

[]

In [ ]: