Variable transformers : PowerTransformer¶

The PowerTransformer() applies power or exponential transformations to numerical variables.

The PowerTransformer() works only with numerical variables.

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from Kaggle

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.imputation import ArbitraryNumberImputer
from feature_engine.transformation import PowerTransformer

In [2]:

# load data

data = pd.read_csv('houseprice.csv')
data.head()

Out[2]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [3]:

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

X_train.shape, X_test.shape

Out[3]:

((1022, 79), (438, 79))

In [4]:

# Initialize Transformers with exponent 1/2
# this is equivalent to square root
# we will transform only 2 variables

et_transformer = PowerTransformer(variables=['LotArea', 'GrLivArea'], exp=0.5)

et_transformer.fit(X_train)

Out[4]:

PowerTransformer(variables=['LotArea', 'GrLivArea'])

In [5]:

# transform variables

train_t = et_transformer.transform(X_train)
test_t = et_transformer.transform(X_test)

In [6]:

# variable before transformation
X_train['GrLivArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('GrLivArea')

Out[6]:

Text(0.5, 0, 'GrLivArea')

In [7]:

# transformed variable
train_t['GrLivArea'].hist(bins=50)
plt.title('Transformed variable')
plt.xlabel('GrLivArea')

Out[7]:

Text(0.5, 0, 'GrLivArea')

In [8]:

# tvariable before transformation
X_train['LotArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('LotArea')

Out[8]:

Text(0.5, 0, 'LotArea')

In [9]:

# transformed variable
train_t['LotArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('LotArea')

Out[9]:

Text(0.5, 0, 'LotArea')

In [10]:

# return variables to original representation

train_orig = et_transformer.inverse_transform(train_t)
test_orig = et_transformer.inverse_transform(test_t)

In [11]:

# inverse transformed variable distribution

train_orig['LotArea'].hist(bins=50)

Out[11]:

<AxesSubplot:>

In [12]:

# inverse transformed variable distribution

train_orig['GrLivArea'].hist(bins=50)

Out[12]:

<AxesSubplot:>

Automatically select numerical variables¶

To use the PowerTransformer we need to ensure that numerical values don't have missing data.

In [13]:

# remove missing data 

arbitrary_imputer = ArbitraryNumberImputer()

arbitrary_imputer.fit(X_train)

# impute variables
train_t = arbitrary_imputer.transform(X_train)
test_t = arbitrary_imputer.transform(X_test)

In [14]:

# initialize transformer with exp as 2

et = PowerTransformer(exp=2, variables=None)

et.fit(train_t)

Out[14]:

PowerTransformer(exp=2)

In [15]:

# variables to trasnform
et.variables_

Out[15]:

['MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold']

In [16]:

# before transformation

train_t['GrLivArea'].hist(bins=50)

Out[16]:

<AxesSubplot:>

In [17]:

# transform variables

train_t = et.transform(train_t)
test_t = et.transform(test_t)

In [18]:

# transformed variable
train_t['GrLivArea'].hist(bins=50)

Out[18]:

<AxesSubplot:>

In [19]:

# return variables to original representation

train_orig = et_transformer.inverse_transform(train_t)
test_orig = et_transformer.inverse_transform(test_t)

In [20]:

# inverse transformed variable distribution

train_orig['LotArea'].hist(bins=50)

Out[20]:

<AxesSubplot:>

In [21]:

# inverse transformed variable distribution

train_orig['GrLivArea'].hist(bins=50)

Out[21]:

<AxesSubplot:>