DecisionTreeDiscretiser

The DecisionTreeDiscretiser() divides continuous numerical variables into discrete, finite, values estimated by a decision tree.

The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

Note

For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:

Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3

http://jse.amstat.org/v19n3/decock.pdf

https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The version of the dataset used in this notebook can be obtained from Kaggle

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.discretisation import DecisionTreeDiscretiser

plt.rcParams["figure.figsize"] = [15,5]

DecisionTreeDiscretiser with Regression

In [2]:
data = pd.read_csv('housing.csv')
data.head()
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [3]:
# let's separate into training and testing set
X = data.drop(["Id", "SalePrice"], axis=1)
y = data.SalePrice

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

print("X_train :", X_train.shape)
print("X_test :", X_test.shape)
X_train : (1022, 79)
X_test : (438, 79)
In [4]:
# we will discretise two continuous variables

X_train[["LotArea", 'GrLivArea']].hist(bins=50)
plt.show()

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select and transform all numerical variables.

The DecisionTreeDiscretiser() first trains a decision tree for each variable, fit.

The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree, transform.

In [5]:
'''
Parameters
----------

cv : int, default=3
    Desired number of cross-validation fold to be used to fit the decision
    tree.

scoring: str, default='neg_mean_squared_error'
    Desired metric to optimise the performance for the tree. Comes from
    sklearn metrics. See DecisionTreeRegressor or DecisionTreeClassifier
    model evaluation documentation for more options:
    https://scikit-learn.org/stable/modules/model_evaluation.html

variables : list
    The list of numerical variables that will be transformed. If None, the
    discretiser will automatically select all numerical type variables.

regression : boolean, default=True
    Indicates whether the discretiser should train a regression or a classification
    decision tree.

param_grid : dictionary, default=None
    The list of parameters over which the decision tree should be optimised
    during the grid search. The param_grid can contain any of the permitted
    parameters for Scikit-learn's DecisionTreeRegressor() or
    DecisionTreeClassifier().

    If None, then param_grid = {'max_depth': [1, 2, 3, 4]}

random_state : int, default=None
    The random_state to initialise the training of the decision tree. It is one
    of the parameters of the Scikit-learn's DecisionTreeRegressor() or
    DecisionTreeClassifier(). For reproducibility it is recommended to set
    the random_state to an integer.
'''

treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='neg_mean_squared_error',
                                   variables=['LotArea', 'GrLivArea'],
                                   regression=True,
                                   random_state=29)

# the DecisionTreeDiscretiser needs the target for fitting
treeDisc.fit(X_train, y_train)
Out[5]:
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},
                        random_state=29, regression=True,
                        scoring='neg_mean_squared_error',
                        variables=['LotArea', 'GrLivArea'])
In [6]:
# the binner_dict_ contains the best decision tree for each variable
treeDisc.binner_dict_
Out[6]:
{'LotArea': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                              max_depth=None, max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=29, splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0),
 'GrLivArea': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                              max_depth=None, max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=29, splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0)}
In [7]:
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
In [8]:
# the below account for the best obtained bins, aka, the tree predictions

train_t['GrLivArea'].unique()
Out[8]:
array([246372.77165354, 149540.32663317, 122286.38839286,  88631.59375   ,
       165174.20895522, 198837.68608414, 312260.5       , 509937.5       ])
In [9]:
# the below account for the best obtained bins, aka, the tree predictions

train_t['LotArea'].unique()
Out[9]:
array([181711.59622642, 145405.30751708, 213802.86363636, 251997.13333333])
In [10]:
# here I put side by side the original variable and the transformed variable

tmp = pd.concat([X_train[["LotArea", 'GrLivArea']],
                 train_t[["LotArea", 'GrLivArea']]], axis=1)

tmp.columns = ["LotArea", 'GrLivArea', "LotArea_binned", 'GrLivArea_binned']

tmp.head()
Out[10]:
LotArea GrLivArea LotArea_binned GrLivArea_binned
64 9375 2034 181711.596226 246372.771654
682 2887 1291 145405.307517 149540.326633
960 7207 858 145405.307517 122286.388393
1384 9060 1258 181711.596226 149540.326633
1100 8400 438 145405.307517 88631.593750
In [11]:
# in  equal frequency discretisation, we obtain the same amount of observations
# in each one of the bins.

plt.subplot(1,2,1)
tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.subplot(1,2,2)
tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.ylabel('Number of houses')

plt.show()

DecisionTreeDiscretiser with binary classification

In [12]:
# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    data['fare'] = data['fare'].astype('float').fillna(0)
    data['age'] = data['age'].astype('float').fillna(0)
    data.drop(['name', 'ticket', 'boat', 'home.dest'], axis=1, inplace=True)
    return data
In [13]:
# load data
data = load_titanic()
data.head()
Out[13]:
pclass survived sex age sibsp parch fare cabin embarked body
0 1 1 female 29.0000 0 0 211.3375 B S NaN
1 1 1 male 0.9167 1 2 151.5500 C S NaN
2 1 0 female 2.0000 1 2 151.5500 C S NaN
3 1 0 male 30.0000 1 2 151.5500 C S 135
4 1 0 female 25.0000 1 2 151.5500 C S NaN
In [14]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop(['survived'], axis=1),
                                                    data['survived'],
                                                    test_size=0.3, 
                                                    random_state=0)

print(X_train.shape)
print(X_test.shape)
(916, 9)
(393, 9)
In [15]:
#this discretiser transforms the numerical variables
X_train[['fare', 'age']].dtypes
Out[15]:
fare    float64
age     float64
dtype: object
In [16]:
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='roc_auc',
                                   variables=['fare', 'age'],
                                   regression=False,
                                   param_grid={'max_depth': [1, 2]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)
Out[16]:
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2]}, random_state=29,
                        regression=False, scoring='roc_auc',
                        variables=['fare', 'age'])
In [17]:
treeDisc.binner_dict_
Out[17]:
{'fare': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='roc_auc', verbose=0),
 'age': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None, param_grid={'max_depth': [1, 2]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='roc_auc', verbose=0)}
In [18]:
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
In [19]:
# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 6 bins is enough
train_t['age'].unique()
Out[19]:
array([0.41295547, 0.26857143])
In [20]:
# the below account for the best obtained bins
# in this case, the tree has found that dividing the data in 8 bins is enough
train_t['fare'].unique()
Out[20]:
array([0.42379182, 0.26778243, 0.52307692, 0.74038462])
In [21]:
# here I put side by side the original variable and the transformed variable

tmp = pd.concat([X_train[["fare", 'age']], train_t[["fare", 'age']]], axis=1)

tmp.columns = ["fare", 'age', "fare_binned", 'age_binned']

tmp.head()
Out[21]:
fare age fare_binned age_binned
501 19.5000 13.0 0.423792 0.412955
588 23.0000 4.0 0.423792 0.412955
402 13.8583 30.0 0.267782 0.412955
1193 7.7250 0.0 0.267782 0.268571
686 7.7250 22.0 0.267782 0.412955
In [22]:
plt.subplot(1,2,1)
tmp.groupby('fare_binned')['fare'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.subplot(1,2,2)
tmp.groupby('age_binned')['age'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of houses per discrete value')

plt.show()
In [23]:
# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target

pd.concat([test_t, y_test], axis=1).groupby(
    'age')['survived'].mean().plot(figsize=(6, 4))

plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()
In [24]:
# The DecisionTreeDiscretiser() returns values which show
# a monotonic relationship with target

pd.concat([test_t, y_test], axis=1).groupby(
    'fare')['survived'].mean().plot(figsize=(6, 4))

plt.ylabel("Mean of target")
plt.title("Relationship between fare and target")
plt.show()

DecisionTreeDiscretiser for Multi-class classification

In [25]:
# Load iris dataset from sklearn
from sklearn.datasets import load_iris

data = pd.DataFrame(load_iris().data, 
                    columns=load_iris().feature_names).join(
    pd.Series(load_iris().target, name='type'))

data.head()
Out[25]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
In [26]:
data.type.unique() # 3 - class classification
Out[26]:
array([0, 1, 2])
In [27]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop('type', axis=1),
                                                    data['type'],
                                                    test_size=0.3,
                                                    random_state=0)

print(X_train.shape)
print(X_test.shape)
(105, 4)
(45, 4)
In [28]:
#selected two numerical variables
X_train[['sepal length (cm)', 'sepal width (cm)']].dtypes
Out[28]:
sepal length (cm)    float64
sepal width (cm)     float64
dtype: object
In [29]:
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=[
                                       'sepal length (cm)', 'sepal width (cm)'],
                                   regression=False,
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)
Out[29]:
DecisionTreeDiscretiser(cv=3, param_grid={'max_depth': [1, 2, 3, 4]},
                        random_state=29, regression=False, scoring='accuracy',
                        variables=['sepal length (cm)', 'sepal width (cm)'])
In [30]:
treeDisc.binner_dict_
Out[30]:
{'sepal length (cm)': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False, scoring='accuracy',
              verbose=0),
 'sepal width (cm)': GridSearchCV(cv=3, error_score=nan,
              estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort='deprecated',
                                               random_state=29,
                                               splitter='best'),
              iid='deprecated', n_jobs=None,
              param_grid={'max_depth': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
              refit=True, return_train_score=False, scoring='accuracy',
              verbose=0)}
In [31]:
train_t = treeDisc.transform(X_train)
test_t = treeDisc.transform(X_test)
In [32]:
# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[['sepal length (cm)', 'sepal width (cm)']],
                 train_t[['sepal length (cm)', 'sepal width (cm)']]], axis=1)

tmp.columns = ['sepal length (cm)', 'sepal width (cm)',
               'sepalLen_binned', 'sepalWid_binned']

tmp.head()
Out[32]:
sepal length (cm) sepal width (cm) sepalLen_binned sepalWid_binned
60 5.0 2.0 0.125000 1.000000
116 6.5 3.0 0.296296 0.250000
144 6.7 3.3 0.296296 0.200000
119 6.0 2.2 0.296296 0.500000
108 6.7 2.5 0.296296 0.434783
In [33]:
plt.subplot(1, 2, 1)
tmp.groupby('sepalLen_binned')['sepal length (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')

plt.subplot(1, 2, 2)
tmp.groupby('sepalWid_binned')['sepal width (cm)'].count().plot.bar()
plt.ylabel('Number of species')
plt.title('Number of observations per discrete value')

plt.show()
In [ ]: