The DecisionTreeDiscretiser() divides continuous numerical variables into discrete, finite, values estimated by a decision tree.
The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
Note
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3
http://jse.amstat.org/v19n3/decock.pdf
https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627
The version of the dataset used in this notebook can be obtained from Kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import DecisionTreeDiscretiser
plt.rcParams["figure.figsize"] = [15,5]
data = pd.read_csv('housing.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# let's separate into training and testing set
X = data.drop(["Id", "SalePrice"], axis=1)
y = data.SalePrice
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
print("X_train :", X_train.shape)
print("X_test :", X_test.shape)
X_train : (1022, 79) X_test : (438, 79)
# we will discretise two continuous variables
X_train[["LotArea", 'GrLivArea']].hist(bins=50)
plt.show()
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select and transform all numerical variables.
The DecisionTreeDiscretiser() first trains a decision tree for each variable, fit.
The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree, transform.
'''
Parameters
----------
cv : int, default=3
Desired number of cross-validation fold to be used to fit the decision
tree.
scoring: str, default='neg_mean_squared_error'
Desired metric to optimise the performance for the tree. Comes from
sklearn metrics. See DecisionTreeRegressor or DecisionTreeClassifier
model evaluation documentation for more options:
https://scikit-learn.org/stable/modules/model_evaluation.html
variables : list
The list of numerical variables that will be transformed. If None, the
discretiser will automatically select all numerical type variables.
regression : boolean, default=True
Indicates whether the discretiser should train a regression or a classification
decision tree.
param_grid : dictionary, default=None
The list of parameters over which the decision tree should be optimised
during the grid search. The param_grid can contain any of the permitted
parameters for Scikit-learn's DecisionTreeRegressor() or
DecisionTreeClassifier().
If None, then param_grid = {'max_depth': [1, 2, 3, 4]}
random_state : int, default=None
The random_state to initialise the training of the decision tree. It is one
of the parameters of the Scikit-learn's DecisionTreeRegressor() or
DecisionTreeClassifier(). For reproducibility it is recommended to set
the random_state to an integer.
'''
treeDisc = DecisionTreeDiscretiser(cv=3,
scoring='neg_mean_squared_error',
variables=['LotArea', 'GrLivArea'],
regression=True,
random_state=29)
# the DecisionTreeDiscretiser needs the target for fitting
treeDisc.fit(X_train, y_train)
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]}, random_state=29, regression=True, scoring='neg_mean_squared_error', variables=['LotArea', 'GrLivArea'])
# the binner_dict_ contains the best decision tree for each variable
treeDisc.binner_dict_
{'LotArea': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='neg_mean_squared_error', verbose=0), 'GrLivArea': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='neg_mean_squared_error', verbose=0)}
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
# the below account for the best obtained bins, aka, the tree predictions
train_t['GrLivArea'].unique()
array([246372.77165354, 149540.32663317, 122286.38839286, 88631.59375 , 165174.20895522, 198837.68608414, 312260.5 , 509937.5 ])
# the below account for the best obtained bins, aka, the tree predictions
train_t['LotArea'].unique()
array([181711.59622642, 145405.30751708, 213802.86363636, 251997.13333333])
# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[["LotArea", 'GrLivArea']],
train_t[["LotArea", 'GrLivArea']]], axis=1)
tmp.columns = ["LotArea", 'GrLivArea', "LotArea_binned", 'GrLivArea_binned']
tmp.head()
LotArea | GrLivArea | LotArea_binned | GrLivArea_binned | |
---|---|---|---|---|
64 | 9375 | 2034 | 181711.596226 | 246372.771654 |
682 | 2887 | 1291 | 145405.307517 | 149540.326633 |
960 | 7207 | 858 | 145405.307517 | 122286.388393 |
1384 | 9060 | 1258 | 181711.596226 | 149540.326633 |
1100 | 8400 | 438 | 145405.307517 | 88631.593750 |
# in equal frequency discretisation, we obtain the same amount of observations
# in each one of the bins.
plt.subplot(1,2,1)
tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')
plt.subplot(1,2,2)
tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.ylabel('Number of houses')
plt.show()
# Load titanic dataset from OpenML
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float').fillna(0)
data['age'] = data['age'].astype('float').fillna(0)
data.drop(['name', 'ticket', 'boat', 'home.dest'], axis=1, inplace=True)
return data
# load data
data = load_titanic()
data.head()
pclass | survived | sex | age | sibsp | parch | fare | cabin | embarked | body | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | female | 29.0000 | 0 | 0 | 211.3375 | B | S | NaN |
1 | 1 | 1 | male | 0.9167 | 1 | 2 | 151.5500 | C | S | NaN |
2 | 1 | 0 | female | 2.0000 | 1 | 2 | 151.5500 | C | S | NaN |
3 | 1 | 0 | male | 30.0000 | 1 | 2 | 151.5500 | C | S | 135 |
4 | 1 | 0 | female | 25.0000 | 1 | 2 | 151.5500 | C | S | NaN |
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data.drop(['survived'], axis=1),
data['survived'],
test_size=0.3,
random_state=0)
print(X_train.shape)
print(X_test.shape)
(916, 9) (393, 9)
#this discretiser transforms the numerical variables
X_train[['fare', 'age']].dtypes
fare float64 age float64 dtype: object
treeDisc = DecisionTreeDiscretiser(cv=3,
scoring='roc_auc',
variables=['fare', 'age'],
regression=False,
param_grid={'max_depth': [1, 2]},
random_state=29,
)
treeDisc.fit(X_train, y_train)
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2]}, random_state=29, regression=False, scoring='roc_auc', variables=['fare', 'age'])
treeDisc.binner_dict_
{'fare': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='roc_auc', verbose=0), 'age': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='roc_auc', verbose=0)}
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 6 bins is enough
train_t['age'].unique()
array([0.41295547, 0.26857143])
# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 8 bins is enough
train_t['fare'].unique()
array([0.42379182, 0.26778243, 0.52307692, 0.74038462])
# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[["fare", 'age']], train_t[["fare", 'age']]], axis=1)
tmp.columns = ["fare", 'age', "fare_binned", 'age_binned']
tmp.head()
fare | age | fare_binned | age_binned | |
---|---|---|---|---|
501 | 19.5000 | 13.0 | 0.423792 | 0.412955 |
588 | 23.0000 | 4.0 | 0.423792 | 0.412955 |
402 | 13.8583 | 30.0 | 0.267782 | 0.412955 |
1193 | 7.7250 | 0.0 | 0.267782 | 0.268571 |
686 | 7.7250 | 22.0 | 0.267782 | 0.412955 |
plt.subplot(1,2,1)
tmp.groupby('fare_binned')['fare'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')
plt.subplot(1,2,2)
tmp.groupby('age_binned')['age'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')
plt.show()
# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target
pd.concat([test_t, y_test], axis=1).groupby(
'age')['survived'].mean().plot(figsize=(6, 4))
plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()
# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target
pd.concat([test_t, y_test], axis=1).groupby(
'fare')['survived'].mean().plot(figsize=(6, 4))
plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()
# Load iris dataset from sklearn
from sklearn.datasets import load_iris
data = pd.DataFrame(load_iris().data,
columns=load_iris().feature_names).join(
pd.Series(load_iris().target, name='type'))
data.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | type | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
data.type.unique() # 3 - class classification
array([0, 1, 2])
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data.drop('type', axis=1),
data['type'],
test_size=0.3,
random_state=0)
print(X_train.shape)
print(X_test.shape)
(105, 4) (45, 4)
#selected two numerical variables
X_train[['sepal length (cm)', 'sepal width (cm)']].dtypes
sepal length (cm) float64 sepal width (cm) float64 dtype: object
treeDisc = DecisionTreeDiscretiser(cv=3,
scoring='accuracy',
variables=[
'sepal length (cm)', 'sepal width (cm)'],
regression=False,
random_state=29,
)
treeDisc.fit(X_train, y_train)
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]}, random_state=29, regression=False, scoring='accuracy', variables=['sepal length (cm)', 'sepal width (cm)'])
treeDisc.binner_dict_
{'sepal length (cm)': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='accuracy', verbose=0), 'sepal width (cm)': GridSearchCV(cv=3, error_score=nan, estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=29, splitter='best'), iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='accuracy', verbose=0)}
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[['sepal length (cm)', 'sepal width (cm)']],
train_t[['sepal length (cm)', 'sepal width (cm)']]], axis=1)
tmp.columns = ['sepal length (cm)', 'sepal width (cm)',
'sepalLen_binned', 'sepalWid_binned']
tmp.head()
sepal length (cm) | sepal width (cm) | sepalLen_binned | sepalWid_binned | |
---|---|---|---|---|
60 | 5.0 | 2.0 | 0.125000 | 1.000000 |
116 | 6.5 | 3.0 | 0.296296 | 0.250000 |
144 | 6.7 | 3.3 | 0.296296 | 0.200000 |
119 | 6.0 | 2.2 | 0.296296 | 0.500000 |
108 | 6.7 | 2.5 | 0.296296 | 0.434783 |
plt.subplot(1, 2, 1)
tmp.groupby('sepalLen_binned')['sepal length (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')
plt.subplot(1, 2, 2)
tmp.groupby('sepalWid_binned')['sepal width (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')
plt.show()