EqualFrequencyDiscretiser + WoEEncoder

This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model.

EqualFrequencyDiscretiser

The EqualFrequencyDiscretiser() divides continuous numerical variables into contiguous equal frequency intervals, that is, intervals that contain approximately the same proportion of observations.

The interval limits are determined by the quantiles. The number of intervals, i.e., the number of quantiles in which the variable should be divided is determined by the user.

Note: Check out the EqualFrequencyDiscretiser notebook to larn more about this transformer.

WoEEncoder

This encoder replaces the labels by the weight of evidence.

It only works for binary classification.

Note: Check out the WoEEncoder notebook to learn more about this transformer.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import WoEEncoder

plt.rcParams["figure.figsize"] = [15,5]
In [2]:
# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float').fillna(data.age.median())
    data['fare'] = data['fare'].astype('float').fillna(data.fare.median())
    data['embarked'].fillna('C', inplace=True)
    data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)
    return data
In [3]:
data = load_titanic()
data.head()
Out[3]:
pclass survived sex age sibsp parch fare cabin embarked
0 1 1 female 29.0000 0 0 211.3375 B S
1 1 1 male 0.9167 1 2 151.5500 C S
2 1 0 female 2.0000 1 2 151.5500 C S
3 1 0 male 30.0000 1 2 151.5500 C S
4 1 0 female 25.0000 1 2 151.5500 C S
In [4]:
# let's separate into training and testing set
X = data.drop(['survived'], axis=1)
y = data.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("X_train :" ,X_train.shape)
print("X_test :" ,X_test.shape)
X_train : (916, 8)
X_test : (393, 8)
In [5]:
# we will use two continuous variables for the transformations
X_train[["age", 'fare']].hist(bins=30)
plt.show()
In [8]:
# set up the discretiser

efd = EqualFrequencyDiscretiser(
    q=4,
    variables=['age', 'fare'],
    # important: return values as categorical
    return_object=True)

# set up the encoder
woe = WoEEncoder(variables=['age', 'fare'])

# pipeline
transformer = Pipeline(steps=[('EqualFrequencyDiscretiser', efd),
                              ('WoEEncoder', woe),
                              ])

transformer.fit(X_train, y_train)
Out[8]:
Pipeline(memory=None,
         steps=[('EqualFrequencyDiscretiser',
                 EqualFrequencyDiscretiser(q=4, return_boundaries=False,
                                           return_object=True,
                                           variables=['age', 'fare'])),
                ('WoEEncoder', WoEEncoder(variables=['age', 'fare']))],
         verbose=False)
In [9]:
transformer.named_steps['EqualFrequencyDiscretiser'].binner_dict_
Out[9]:
{'age': [-inf, 23.0, 28.0, 35.0, inf],
 'fare': [-inf, 7.8958, 14.4542, 31.275, inf]}
In [10]:
transformer.named_steps['WoEEncoder'].encoder_dict_
Out[10]:
{'age': {0: 0.07533270507296917,
  1: -0.260402163917158,
  2: 0.3237107275657203,
  3: 0.05769015189511875},
 'fare': {0: -0.5990108946387251,
  1: -0.41504696424627724,
  2: 0.142571903020815,
  3: 0.7852653023249282}}
In [11]:
train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)

test_t.head()
Out[11]:
pclass sex age sibsp parch fare cabin embarked
1139 3 male 0.057690 0 0 -0.599011 n S
533 2 female 0.075333 0 1 0.142572 n S
459 2 male 0.057690 1 0 0.142572 n S
1150 3 male -0.260402 0 0 0.142572 n S
393 2 male -0.260402 0 0 0.785265 n S
In [15]:
# let's explore the monotonic relationship
plt.figure(figsize=(7,5))
pd.concat([test_t,y_test], axis=1).groupby("fare")["survived"].mean().plot()
plt.title("Relationship between fare and target")
plt.xlabel("fare")
plt.ylabel("Mean of target")
plt.show()

Note how now the intervals are monotonically sorted respect to the target.

In [ ]: