Missing value imputation: RandomSampleImputer¶

The RandomSampleImputer extracts a random sample of observations where data is available, and uses it to replace the NA. It is suitable for numerical and categorical variables.

To control the random sample extraction, there are various ways to set a seed and ensure or maximize reproducibility.

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

The version of the dataset used in this notebook can be obtained from Kaggle

Version¶

In [1]:

# Make sure you are using this 
# Feature-engine version.

import feature_engine

feature_engine.__version__

Out[1]:

'1.2.0'

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.imputation import RandomSampleImputer

In [3]:

# Download the data from Kaggle and store it
# in the same folder as this notebook.

data = pd.read_csv('houseprice.csv')

data.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

# Separate the data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

Out[4]:

((1022, 79), (438, 79))

Imputation in batch¶

We can set the imputer to impute several observations in batch with a unique seed. This is the equivalent of setting the random_state to an integer in pandas.sample().

In [5]:

# Start the imputer

imputer = RandomSampleImputer(

    # the variables to impute
    variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'],

    # the random state for reproducibility
    random_state=10,

    # equialent to setting random_state in
    # pandas.sample()
    seed='general',
)

In [6]:

# Stores a copy of the train set variables

imputer.fit(X_train)

Out[6]:

RandomSampleImputer(random_state=10,
                    variables=['Alley', 'MasVnrType', 'LotFrontage',
                               'MasVnrArea'])

In [7]:

# the imputer saves a copy of the variables 
# from the training set to impute new data.

imputer.X_.head()

Out[7]:

	Alley	MasVnrType	LotFrontage	MasVnrArea
64	NaN	BrkFace	NaN	573.0
682	NaN	None	NaN	0.0
960	NaN	None	50.0	0.0
1384	NaN	None	60.0	0.0
1100	NaN	None	60.0	0.0

In [8]:

# Check missing data in train set

X_train[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].isnull().mean()

Out[8]:

Alley          0.939335
MasVnrType     0.004892
LotFrontage    0.184932
MasVnrArea     0.004892
dtype: float64

In [9]:

# impute data

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [10]:

# Check missing data after the transformation

train_t[['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']].isnull().mean()

Out[10]:

Alley          0.0
MasVnrType     0.0
LotFrontage    0.0
MasVnrArea     0.0
dtype: float64

In [11]:

# when using the random sample imputer, 
# the distribution of the variable does not change.

# This imputation method is useful for models that 
# are sensitive to changes in the variable distributions.

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[11]:

<matplotlib.legend.Legend at 0x6ac7886a0>

Specific seeds for each observation¶

Sometimes, we want to guarantee that the same observation is imputed with the same value, run after run.

To achieve this, we need to always use the same seed for every particular observation.

To do this, we can use the values in neighboring variables as seed.

In this case, the seed will be calculated observation per observation, either by adding or multiplying the seeding variable values, and passed to the random_state of pandas.sample(), which is used under the hood by the imputer. Then, a value will be extracted from the train set using that seed and used to replace the NAN in particular observation.

To know more about how the observation per seed is used check this notebook

In [12]:

imputer = RandomSampleImputer(

    # the values of these variables will be used as seed
    random_state=['MSSubClass', 'YrSold'],

    # 1 seed per observation
    seed='observation',

    # how to combine the values of the seeding variables
    seeding_method='add',
    
    # impute all variables, numerical and categorical
    variables=None,
)

In [13]:

# Stores a copy of the train set.

imputer.fit(X_train)

Out[13]:

RandomSampleImputer(random_state=['MSSubClass', 'YrSold'], seed='observation')

In [14]:

# takes a copy of the entire train set

imputer.X_

Out[14]:

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
64	60	RL	NaN	9375	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	GdPrv	NaN	0	2	2009	WD	Normal
682	120	RL	NaN	2887	Pave	NaN	Reg	HLS	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	11	2008	WD	Normal
960	20	RL	50.0	7207	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	2	2010	WD	Normal
1384	50	RL	60.0	9060	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	MnPrv	NaN	0	10	2009	WD	Normal
1100	30	RL	60.0	8400	Pave	NaN	Reg	Bnk	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	1	2009	WD	Normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
763	60	RL	82.0	9430	Pave	NaN	Reg	Lvl	AllPub	Inside	...	180	0	NaN	NaN	NaN	0	7	2009	WD	Normal
835	20	RL	60.0	9600	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	2	2010	WD	Normal
1216	90	RM	68.0	8930	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	4	2010	WD	Normal
559	120	RL	NaN	3196	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	10	2006	WD	Normal
684	60	RL	58.0	16770	Pave	NaN	IR2	Lvl	AllPub	CulDSac	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal

1022 rows × 79 columns

In [15]:

# imputes all variables.

# this procedure takes a while because it is 
# done observation per observation.

train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

In [16]:

# No missing data in any variable
# after the imputation.

test_t.isnull().sum()

Out[16]:

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MiscVal          0
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
Length: 79, dtype: int64

In [17]:

# when using the random sample imputer, 
# the distribution of the variable does not change

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[17]:

<matplotlib.legend.Legend at 0x6ae8f2340>

In [ ]: