DecisionTreeDiscretiser¶

The DecisionTreeDiscretiser() divides continuous numerical variables into discrete, finite, values estimated by a decision tree.

The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

Note

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from Kaggle

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.discretisation import DecisionTreeDiscretiser

plt.rcParams["figure.figsize"] = [15,5]

DecisionTreeDiscretiser with Regression¶

In [2]:

data = pd.read_csv('housing.csv')
data.head()

Out[2]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [3]:

# let's separate into training and testing set
X = data.drop(["Id", "SalePrice"], axis=1)
y = data.SalePrice

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

print("X_train :", X_train.shape)
print("X_test :", X_test.shape)

X_train : (1022, 79)
X_test : (438, 79)

In [4]:

# we will discretise two continuous variables

X_train[["LotArea", 'GrLivArea']].hist(bins=50)
plt.show()

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select and transform all numerical variables.

The DecisionTreeDiscretiser() first trains a decision tree for each variable, fit.

The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree, transform.

In [5]:

'''
Parameters
----------

cv : int, default=3
    Desired number of cross-validation fold to be used to fit the decision
    tree.

scoring: str, default='neg_mean_squared_error'
    Desired metric to optimise the performance for the tree. Comes from
    sklearn metrics. See DecisionTreeRegressor or DecisionTreeClassifier
    model evaluation documentation for more options:
    https://scikit-learn.org/stable/modules/model_evaluation.html

variables : list
    The list of numerical variables that will be transformed. If None, the
    discretiser will automatically select all numerical type variables.

regression : boolean, default=True
    Indicates whether the discretiser should train a regression or a classification
    decision tree.

param_grid : dictionary, default=None
    The list of parameters over which the decision tree should be optimised
    during the grid search. The param_grid can contain any of the permitted
    parameters for Scikit-learn's DecisionTreeRegressor() or
    DecisionTreeClassifier().

    If None, then param_grid = {'max_depth': [1, 2, 3, 4]}

random_state : int, default=None
    The random_state to initialise the training of the decision tree. It is one
    of the parameters of the Scikit-learn's DecisionTreeRegressor() or
    DecisionTreeClassifier(). For reproducibility it is recommended to set
    the random_state to an integer.
'''

treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='neg_mean_squared_error',
                                   variables=['LotArea', 'GrLivArea'],
                                   regression=True,
                                   random_state=29)

# the DecisionTreeDiscretiser needs the target for fitting
treeDisc.fit(X_train, y_train)

Out[5]:

DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},
                        random_state=29, regression=True,
                        scoring='neg_mean_squared_error',
                        variables=['LotArea', 'GrLivArea'])

In [6]:

# the binner_dict_ contains the best decision tree for each variable
treeDisc.binner_dict_

Out[6]:

{'LotArea': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                              max_depth=None, max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=29, splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0),
 'GrLivArea': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                              max_depth=None, max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=29, splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0)}

In [7]:

train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)

In [8]:

# the below account for the best obtained bins, aka, the tree predictions

train_t['GrLivArea'].unique()

Out[8]:

array([246372.77165354, 149540.32663317, 122286.38839286,  88631.59375   ,
       165174.20895522, 198837.68608414, 312260.5       , 509937.5       ])

In [9]:

# the below account for the best obtained bins, aka, the tree predictions

train_t['LotArea'].unique()

Out[9]:

array([181711.59622642, 145405.30751708, 213802.86363636, 251997.13333333])

In [10]:

# here I put side by side the original variable and the transformed variable

tmp = pd.concat([X_train[["LotArea", 'GrLivArea']],
                 train_t[["LotArea", 'GrLivArea']]], axis=1)

tmp.columns = ["LotArea", 'GrLivArea', "LotArea_binned", 'GrLivArea_binned']

tmp.head()

Out[10]:

	LotArea	GrLivArea	LotArea_binned	GrLivArea_binned
64	9375	2034	181711.596226	246372.771654
682	2887	1291	145405.307517	149540.326633
960	7207	858	145405.307517	122286.388393
1384	9060	1258	181711.596226	149540.326633
1100	8400	438	145405.307517	88631.593750

In [11]:

# in  equal frequency discretisation, we obtain the same amount of observations
# in each one of the bins.

plt.subplot(1,2,1)
tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.subplot(1,2,2)
tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.ylabel('Number of houses')

plt.show()

DecisionTreeDiscretiser with binary classification¶

In [12]:

# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    data['fare'] = data['fare'].astype('float').fillna(0)
    data['age'] = data['age'].astype('float').fillna(0)
    data.drop(['name', 'ticket', 'boat', 'home.dest'], axis=1, inplace=True)
    return data

In [13]:

# load data
data = load_titanic()
data.head()

Out[13]:

	pclass	survived	sex	age	sibsp	parch	fare	cabin	embarked	body
0	1	1	female	29.0000	0	0	211.3375	B	S	NaN
1	1	1	male	0.9167	1	2	151.5500	C	S	NaN
2	1	0	female	2.0000	1	2	151.5500	C	S	NaN
3	1	0	male	30.0000	1	2	151.5500	C	S	135
4	1	0	female	25.0000	1	2	151.5500	C	S	NaN

In [14]:

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop(['survived'], axis=1),
                                                    data['survived'],
                                                    test_size=0.3, 
                                                    random_state=0)

print(X_train.shape)
print(X_test.shape)

(916, 9)
(393, 9)

In [15]:

#this discretiser transforms the numerical variables
X_train[['fare', 'age']].dtypes

Out[15]:

fare    float64
age     float64
dtype: object

In [16]:

treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='roc_auc',
                                   variables=['fare', 'age'],
                                   regression=False,
                                   param_grid={'max_depth': [1, 2]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

Out[16]:

DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2]}, random_state=29,
                        regression=False, scoring='roc_auc',
                        variables=['fare', 'age'])

In [17]:

treeDisc.binner_dict_

Out[17]:

{'fare': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='roc_auc', verbose=0),
 'age': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='roc_auc', verbose=0)}

In [18]:

train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)

In [19]:

# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 6 bins is enough
train_t['age'].unique()

Out[19]:

array([0.41295547, 0.26857143])

In [20]:

# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 8 bins is enough
train_t['fare'].unique()

Out[20]:

array([0.42379182, 0.26778243, 0.52307692, 0.74038462])

In [21]:

# here I put side by side the original variable and the transformed variable

tmp = pd.concat([X_train[["fare", 'age']], train_t[["fare", 'age']]], axis=1)

tmp.columns = ["fare", 'age', "fare_binned", 'age_binned']

tmp.head()

Out[21]:

	fare	age	fare_binned	age_binned
501	19.5000	13.0	0.423792	0.412955
588	23.0000	4.0	0.423792	0.412955
402	13.8583	30.0	0.267782	0.412955
1193	7.7250	0.0	0.267782	0.268571
686	7.7250	22.0	0.267782	0.412955

In [22]:

plt.subplot(1,2,1)
tmp.groupby('fare_binned')['fare'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.subplot(1,2,2)
tmp.groupby('age_binned')['age'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.show()

In [23]:

# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target

pd.concat([test_t, y_test], axis=1).groupby(
    'age')['survived'].mean().plot(figsize=(6, 4))

plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()

In [24]:

# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target

pd.concat([test_t, y_test], axis=1).groupby(
    'fare')['survived'].mean().plot(figsize=(6, 4))

plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()

DecisionTreeDiscretiser for Multi-class classification¶

In [25]:

# Load iris dataset from sklearn
from sklearn.datasets import load_iris

data = pd.DataFrame(load_iris().data, 
                    columns=load_iris().feature_names).join(
    pd.Series(load_iris().target, name='type'))

data.head()

Out[25]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [26]:

data.type.unique() # 3 - class classification

Out[26]:

array([0, 1, 2])

In [27]:

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop('type', axis=1),
                                                    data['type'],
                                                    test_size=0.3,
                                                    random_state=0)

print(X_train.shape)
print(X_test.shape)

(105, 4)
(45, 4)

In [28]:

#selected two numerical variables
X_train[['sepal length (cm)', 'sepal width (cm)']].dtypes

Out[28]:

sepal length (cm)    float64
sepal width (cm)     float64
dtype: object

In [29]:

treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=[
                                       'sepal length (cm)', 'sepal width (cm)'],
                                   regression=False,
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

Out[29]:

DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},
                        random_state=29, regression=False, scoring='accuracy',
                        variables=['sepal length (cm)', 'sepal width (cm)'])

In [30]:

treeDisc.binner_dict_

Out[30]:

{'sepal length (cm)': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False, scoring='accuracy',
              verbose=0),
 'sepal width (cm)': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False, scoring='accuracy',
              verbose=0)}

In [31]:

train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)

In [32]:

# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[['sepal length (cm)', 'sepal width (cm)']],
                 train_t[['sepal length (cm)', 'sepal width (cm)']]], axis=1)

tmp.columns = ['sepal length (cm)', 'sepal width (cm)',
               'sepalLen_binned', 'sepalWid_binned']

tmp.head()

Out[32]:

	sepal length (cm)	sepal width (cm)	sepalLen_binned	sepalWid_binned
60	5.0	2.0	0.125000	1.000000
116	6.5	3.0	0.296296	0.250000
144	6.7	3.3	0.296296	0.200000
119	6.0	2.2	0.296296	0.500000
108	6.7	2.5	0.296296	0.434783

In [33]:

plt.subplot(1, 2, 1)
tmp.groupby('sepalLen_binned')['sepal length (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')

plt.subplot(1, 2, 2)
tmp.groupby('sepalWid_binned')['sepal width (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')

plt.show()

In [ ]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2