The PowerTransformer() applies power or exponential transformations to numerical variables.
The PowerTransformer() works only with numerical variables.
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3
http://jse.amstat.org/v19n3/decock.pdf
https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627
The version of the dataset used in this notebook can be obtained from Kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import ArbitraryNumberImputer
from feature_engine.transformation import PowerTransformer
# load data
data = pd.read_csv('houseprice.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
X_train.shape, X_test.shape
((1022, 79), (438, 79))
# Initialize Transformers with exponent 1/2
# this is equivalent to square root
# we will transform only 2 variables
et_transformer = PowerTransformer(variables=['LotArea', 'GrLivArea'], exp=0.5)
et_transformer.fit(X_train)
PowerTransformer(variables=['LotArea', 'GrLivArea'])
# transform variables
train_t = et_transformer.transform(X_train)
test_t = et_transformer.transform(X_test)
# variable before transformation
X_train['GrLivArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('GrLivArea')
Text(0.5, 0, 'GrLivArea')
# transformed variable
train_t['GrLivArea'].hist(bins=50)
plt.title('Transformed variable')
plt.xlabel('GrLivArea')
Text(0.5, 0, 'GrLivArea')
# tvariable before transformation
X_train['LotArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('LotArea')
Text(0.5, 0, 'LotArea')
# transformed variable
train_t['LotArea'].hist(bins=50)
plt.title('Variable before transformation')
plt.xlabel('LotArea')
Text(0.5, 0, 'LotArea')
# return variables to original representation
train_orig = et_transformer.inverse_transform(train_t)
test_orig = et_transformer.inverse_transform(test_t)
# inverse transformed variable distribution
train_orig['LotArea'].hist(bins=50)
<AxesSubplot:>
# inverse transformed variable distribution
train_orig['GrLivArea'].hist(bins=50)
<AxesSubplot:>
To use the PowerTransformer we need to ensure that numerical values don't have missing data.
# remove missing data
arbitrary_imputer = ArbitraryNumberImputer()
arbitrary_imputer.fit(X_train)
# impute variables
train_t = arbitrary_imputer.transform(X_train)
test_t = arbitrary_imputer.transform(X_test)
# initialize transformer with exp as 2
et = PowerTransformer(exp=2, variables=None)
et.fit(train_t)
PowerTransformer(exp=2)
# variables to trasnform
et.variables_
['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
# before transformation
train_t['GrLivArea'].hist(bins=50)
<AxesSubplot:>
# transform variables
train_t = et.transform(train_t)
test_t = et.transform(test_t)
# transformed variable
train_t['GrLivArea'].hist(bins=50)
<AxesSubplot:>
# return variables to original representation
train_orig = et_transformer.inverse_transform(train_t)
test_orig = et_transformer.inverse_transform(test_t)
# inverse transformed variable distribution
train_orig['LotArea'].hist(bins=50)
<AxesSubplot:>
# inverse transformed variable distribution
train_orig['GrLivArea'].hist(bins=50)
<AxesSubplot:>