The LogTransformer() applies the natural logarithm or the base 10 logarithm to numerical variables. The natural logarithm is logarithm in base e.
The LogTransformer() only works with numerical non-negative values. If the variable contains a zero or a negative value the transformer will return an error.
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3
http://jse.amstat.org/v19n3/decock.pdf
https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627
The version of the dataset used in this notebook can be obtained from Kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import ArbitraryNumberImputer
from feature_engine.transformation import LogTransformer
# load data
data = pd.read_csv('houseprice.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
X_train.shape, X_test.shape
((1022, 79), (438, 79))
# plot distributions before transformation
X_train['LotArea'].hist(bins=50)
<AxesSubplot:>
# plot distributions before transformation
X_train['GrLivArea'].hist(bins=50)
<AxesSubplot:>
# Initialzing the tansformer with log base e
lt = LogTransformer(variables=['LotArea', 'GrLivArea'], base='e')
lt.fit(X_train)
LogTransformer(variables=['LotArea', 'GrLivArea'])
# variables that will be transformed
lt.variables_
['LotArea', 'GrLivArea']
# apply the log transform
train_t = lt.transform(X_train)
test_t = lt.transform(X_test)
# transformed variable distribution
train_t['LotArea'].hist(bins=50)
<AxesSubplot:>
# transformed variable distribution
train_t['GrLivArea'].hist(bins=50)
<AxesSubplot:>
# return variables to original representation
train_orig = lt.inverse_transform(train_t)
test_orig = lt.inverse_transform(test_t)
# inverse transformed variable distribution
train_orig['LotArea'].hist(bins=50)
<AxesSubplot:>
# inverse transformed variable distribution
train_orig['GrLivArea'].hist(bins=50)
<AxesSubplot:>
The transformer will transform all numerical variables if no variables are specified.
# load numerical variables only
variables = ['LotFrontage', 'LotArea',
'1stFlrSF', 'GrLivArea',
'TotRmsAbvGrd', 'SalePrice']
data = pd.read_csv('houseprice.csv', usecols=variables)
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
X_train.shape, X_test.shape
((1022, 5), (438, 5))
# Impute missing values
arbitrary_imputer = ArbitraryNumberImputer(arbitrary_number=2)
arbitrary_imputer.fit(X_train)
# impute variables
train_t = arbitrary_imputer.transform(X_train)
test_t = arbitrary_imputer.transform(X_test)
# transform all numerical variables with base 10
lt = LogTransformer(base='10', variables=None)
lt.fit(train_t)
LogTransformer(base='10')
# variables that will be transformed
lt.variables_
['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'TotRmsAbvGrd']
# before transformation
train_t['GrLivArea'].hist(bins=50)
plt.title('GrLivArea')
Text(0.5, 1.0, 'GrLivArea')
# Before transformation
train_t['LotArea'].hist(bins=50)
plt.title('LotArea')
Text(0.5, 1.0, 'LotArea')
# transform the data
train_t = lt.transform(train_t)
test_t = lt.transform(test_t)
# transformed variable
train_t['GrLivArea'].hist(bins=50)
plt.title('GrLivArea')
Text(0.5, 1.0, 'GrLivArea')
# transformed variable
train_t['LotArea'].hist(bins=50)
plt.title('LotArea')
Text(0.5, 1.0, 'LotArea')
# return variables to original representation
train_orig = lt.inverse_transform(train_t)
test_orig = lt.inverse_transform(test_t)
# inverse transformed variable distribution
train_orig['LotArea'].hist(bins=50)
<AxesSubplot:>
# inverse transformed variable distribution
train_orig['GrLivArea'].hist(bins=50)
<AxesSubplot:>