This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model.
The EqualWidthDiscretiser() divides continuous numerical variables into intervals of the same width, that is, equidistant intervals. Note that the proportion of observations per interval may vary.
The number of intervals in which the variable should be divided must be indicated by the user.
Note: Check out the EqualWidthDiscretiser notebook to learn more about this transformer.
The OrdinalEncoder() will replace the variable labels by digits, from 1 to the number of different labels.
If we select "arbitrary", then the encoder will assign numbers as the labels appear in the variable (first come first served).
If we select "ordered", the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 1, and those where the mean of the target is smallest will get the number n.
Note: Check out the OrdinalEncoder notebook to know more about this transformer.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.encoding import OrdinalEncoder
plt.rcParams["figure.figsize"] = [15,5]
# Load titanic dataset from OpenML
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['age'] = data['age'].astype('float').fillna(data.age.median())
data['fare'] = data['fare'].astype('float').fillna(data.fare.median())
data['embarked'].fillna('C', inplace=True)
data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)
return data
data = load_titanic()
data.head()
pclass | survived | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | female | 29.0000 | 0 | 0 | 211.3375 | B | S |
1 | 1 | 1 | male | 0.9167 | 1 | 2 | 151.5500 | C | S |
2 | 1 | 0 | female | 2.0000 | 1 | 2 | 151.5500 | C | S |
3 | 1 | 0 | male | 30.0000 | 1 | 2 | 151.5500 | C | S |
4 | 1 | 0 | female | 25.0000 | 1 | 2 | 151.5500 | C | S |
# let's separate into training and testing set
X = data.drop(['survived'], axis=1)
y = data.survived
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
print("X_train :", X_train.shape)
print("X_test :", X_test.shape)
X_train : (916, 8) X_test : (393, 8)
# we will use two continuous variables for the transformations
X_train[["age", 'fare']].hist(bins=30)
plt.show()
# set up the discretiser
ewd = EqualWidthDiscretiser(
bins=5,
variables=['age', 'fare'],
# important: return values as categorical
return_object=True)
# set up the encoder
oe = OrdinalEncoder(variables=['age', 'fare'])
# pipeline
transformer = Pipeline(steps=[('EqualWidthDiscretiser', ewd),
('OrdinalEncoder', oe),
])
transformer.fit(X_train, y_train)
Pipeline(memory=None, steps=[('EqualWidthDiscretiser', EqualWidthDiscretiser(bins=5, return_boundaries=False, return_object=True, variables=['age', 'fare'])), ('OrdinalEncoder', OrdinalEncoder(encoding_method='ordered', variables=['age', 'fare']))], verbose=False)
transformer.named_steps['EqualWidthDiscretiser'].binner_dict_
{'age': [-inf, 14.933359999999999, 29.700019999999995, 44.46667999999999, 59.23333999999999, inf], 'fare': [-inf, 102.46584, 204.93168, 307.39752, 409.86336, inf]}
transformer.named_steps['OrdinalEncoder'].encoder_dict_
{'age': {4: 0, 1: 1, 2: 2, 3: 3, 0: 4}, 'fare': {0: 0, 2: 1, 1: 2, 4: 3}}
train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)
test_t.head()
pclass | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|
1139 | 3 | male | 2 | 0 | 0 | 0 | n | S |
533 | 2 | female | 1 | 0 | 1 | 0 | n | S |
459 | 2 | male | 2 | 1 | 0 | 0 | n | S |
1150 | 3 | male | 1 | 0 | 0 | 0 | n | S |
393 | 2 | male | 1 | 0 | 0 | 0 | n | S |
# let's explore the monotonic relationship
plt.figure(figsize=(7,5))
pd.concat([test_t,y_test], axis=1).groupby("fare")["survived"].mean().plot()
plt.title("Relationship between fare and target")
plt.xlabel("fare")
plt.ylabel("Mean of target")
plt.show()
Note how the bins are monotonically ordered with the target.