# EqualWidthDiscretiser + OrdinalEncoder¶

This is very useful for linear models, because by using discretisation + a monotonic encoding, we create monotonic variables with the target, from those that before were not originally. And this tends to help improve the performance of the linear model.

## EqualWidthDiscretiser¶

The EqualWidthDiscretiser() divides continuous numerical variables into intervals of the same width, that is, equidistant intervals. Note that the proportion of observations per interval may vary.

The number of intervals in which the variable should be divided must be indicated by the user.

## OrdinalEncoder¶

The OrdinalEncoder() will replace the variable labels by digits, from 1 to the number of different labels.

If we select "arbitrary", then the encoder will assign numbers as the labels appear in the variable (first come first served).

If we select "ordered", the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 1, and those where the mean of the target is smallest will get the number n.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.encoding import OrdinalEncoder

plt.rcParams["figure.figsize"] = [15,5]

In [3]:
# Load titanic dataset from OpenML

data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['age'] = data['age'].astype('float').fillna(data.age.median())
data['fare'] = data['fare'].astype('float').fillna(data.fare.median())
data['embarked'].fillna('C', inplace=True)
data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)
return data

In [4]:
data = load_titanic()

Out[4]:
pclass survived sex age sibsp parch fare cabin embarked
0 1 1 female 29.0000 0 0 211.3375 B S
1 1 1 male 0.9167 1 2 151.5500 C S
2 1 0 female 2.0000 1 2 151.5500 C S
3 1 0 male 30.0000 1 2 151.5500 C S
4 1 0 female 25.0000 1 2 151.5500 C S
In [5]:
# let's separate into training and testing set
X = data.drop(['survived'], axis=1)
y = data.survived

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)

print("X_train :", X_train.shape)
print("X_test :", X_test.shape)

X_train : (916, 8)
X_test : (393, 8)

In [6]:
# we will use two continuous variables for the transformations

X_train[["age", 'fare']].hist(bins=30)
plt.show()

In [7]:
# set up the discretiser
ewd = EqualWidthDiscretiser(
bins=5,
variables=['age', 'fare'],
# important: return values as categorical
return_object=True)

# set up the encoder
oe = OrdinalEncoder(variables=['age', 'fare'])

# pipeline
transformer = Pipeline(steps=[('EqualWidthDiscretiser', ewd),
('OrdinalEncoder', oe),
])

transformer.fit(X_train, y_train)

Out[7]:
Pipeline(memory=None,
steps=[('EqualWidthDiscretiser',
EqualWidthDiscretiser(bins=5, return_boundaries=False,
return_object=True,
variables=['age', 'fare'])),
('OrdinalEncoder',
OrdinalEncoder(encoding_method='ordered',
variables=['age', 'fare']))],
verbose=False)
In [8]:
transformer.named_steps['EqualWidthDiscretiser'].binner_dict_

Out[8]:
{'age': [-inf,
14.933359999999999,
29.700019999999995,
44.46667999999999,
59.23333999999999,
inf],
'fare': [-inf, 102.46584, 204.93168, 307.39752, 409.86336, inf]}
In [9]:
transformer.named_steps['OrdinalEncoder'].encoder_dict_

Out[9]:
{'age': {4: 0, 1: 1, 2: 2, 3: 3, 0: 4}, 'fare': {0: 0, 2: 1, 1: 2, 4: 3}}
In [10]:
train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)


Out[10]:
pclass sex age sibsp parch fare cabin embarked
1139 3 male 2 0 0 0 n S
533 2 female 1 0 1 0 n S
459 2 male 2 1 0 0 n S
1150 3 male 1 0 0 0 n S
393 2 male 1 0 0 0 n S
In [12]:
# let's explore the monotonic relationship
plt.figure(figsize=(7,5))
pd.concat([test_t,y_test], axis=1).groupby("fare")["survived"].mean().plot()
plt.title("Relationship between fare and target")
plt.xlabel("fare")
plt.ylabel("Mean of target")
plt.show()


Note how the bins are monotonically ordered with the target.

In [ ]: