In [189]:

```
from sklearn.datasets import load_boston, load_iris
from sklearn.externals.six import StringIO
import pydot
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.cross_validation import train_test_split, cross_val_score
%matplotlib inline
matplotlib.style.use('ggplot')
```

In this notebook I'm reviewing heuristics for ranking features in decision trees and random forests, and their implementation in sklearn. In the text I sometimes use variable as a synonym of feature. Feature selection is a step embedded in the process of growing decision trees. An attractive side effect is that once model is built, we can retrive the relative importance of each feature in the decision making procees. This not only increases the general interpretability of the model, but can help both in exploratory data analysis as well as in feature engineering piepelines. A nice example of how tree based models can be used to improve feature engineering can be found in the winner recap notes of the Criteo Kaggle competition (http://machine-learning-notes.blogspot.nl/2014/12/kaggle-criteo-winner-method-recap.html).

Tree based models, and ensambles of trees, are very powerful and well understood methods for supervised learning. Ensamles such as Random Forests are roboust and stable methods for both classification and regression, while decision trees allow for conceptually simple and easilly interpretable models. For an overview of tree models in classification and regression in sklean, there is an excellent talk from Gilles Louppe at PyData Paris 2015 (http://www.slideshare.net/glouppe/slides-46767187). A well regarded paper from the same author that provides a thorough analysis of the mathematics of feature selection in tree ensamles is "Understanding variable importances in forests of randomized trees" (Louppe et al. 2014). With this notebook I'm attempting to fill some gaps and bridge literature review to implementation (sklearn).

In [190]:

```
boston = load_boston()
iris = load_iris()
```

*M* regions, a *regression tree* would predict a response $Y$ given inputs $X = \{x_1, x_2, .. x_n\}$ as $$f(x_i) = \sum\limits_{m=1}^M c_mI(x_i\in R_m)$$
where $I$ is the indicator function and $c_m$ a constant that models $Y$ in region $R_m$. This model can be represented by a tree that has $R_1..R_m$ as its terminal nodes. In both the regression and classification cases, the algorithm decides the splitting variables and split points, and tree topology. Essentially, at each split point, the algorithm performs a feature selection step using an heuristic to estimate information gain; this is represented by a numerical value known as the *Gini importance* of a feature (not to be confused with the Gini index!).
We can exploit this very same process to rank features (features engineering) and to explain their imporance with regards to the models we want to learn from data.

*mean decreased impurity* is defined as the total decrease in node impurity $i(s, n)$, at split $s$ for some impurity function $i(n)$ . That is
$$ \Delta i(s, n) = i(n) - p_L i(t_L) - p_R i(t_R) $$
Where $p_L$ and $p_R$ are the proportion of $N_L$ and $N_R$ samples in the left and right splits, over the total number of samples $N_n$ for node $n$.

Under the hood, sklearn calculates MDI as follows:

```
cdef class Tree:
[...]
cpdef compute_feature_importances(self, normalize=True):
[...]
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
[...]
```

*node.feature* objects) and the associated importance score is kept in *importance_data*. The node impurity is weighted by the probability of reaching that node, approximated by the proportion of samples (*weighted_n_node_samples*) reaching that node.

```
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity
```

*impurity* criteria are defined by implementations of the *Criterion* interface. For classification trees, impurity criteria are *Entropy* - cross entropy - and *Gini*, the Gini index. For regression trees the impurity criteria is *MSE* (mean squared error).

*sklearn.tree.DecisionTreeRegressor* uses mean squared error (MSE) to determine the quality of a split; that is, an estimate $\hat{c}_m$ for $c_m$ is calcuated so to minimize the sum of squares $\sum (y_i - \hat{f}(x_i))^2$ with $y_i$ being the target variable and $\hat{f}(x_i)$ its predicted value. The best (*proof!*) estimator is obtained by taking $\hat{f}({x_i}) = avg(y_i | x_i \in R_m)$. By bias variance decompostion of the squared error , we have $Var[\hat{f}(x_i)] = E[(\hat{f}(x_i) - E[\hat{f}(x_i)])^2]$. So, for a split node, the mean squared error can then be calculated as the sum of variance of the left and right split: $MSE = Var_{left} + Var_{right}$

In [92]:

```
# To keep the diagram tractable, restrict the tree to at most 5 leaf nodes
reg = DecisionTreeRegressor(max_leaf_nodes=5)
reg.fit(boston.data, boston.target)
dot_data = StringIO()
export_graphviz(reg, out_file="/tmp/tree.dot",
feature_names=boston.feature_names)
# pydot.graph_from_dot_data is broken in my env
!dot -Tpng /tmp/tree.dot -o /tmp/tree.png
from IPython.display import Image
Image(filename='/tmp/tree.png')
```

Out[92]:

*feature_importances_* property.

In [179]:

```
reg = DecisionTreeRegressor(max_leaf_nodes=5)
reg.fit(boston.data, boston.target)
plt = pd.DataFrame(zip(boston.feature_names, reg.feature_importances_),
columns=['feature', 'Gini importance']).plot(kind='bar', x='feature')
_ = plt.set_title('Regression tree on the Boston dataset (5 leaf nodes)')
```

Things become a bit more interesting when we grow larger trees.

In [180]:

```
reg = DecisionTreeRegressor()
reg.fit(boston.data, boston.target)
plt = pd.DataFrame(zip(boston.feature_names, reg.feature_importances_),
columns=['feature', 'Gini importance']).plot(kind='bar', x='feature')
_ = plt.set_title('Regression tree on the Boston dataset')
```

Assume a classification task where the target $k$ classes take values in $0,1,...,K−1$. If node $m$ represents a region $R_m$ with $N_m$ observations, then let $p_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)$ be the proportion of class $k$ observations in node $m$. $sklearn.DecisionTreeClassifier$ allows two impurity criteria for determining splits in such a setting. On the Iris dataset the two criteria agree on which features are important. Experimental results suggest that for tree induction purposes both impurity measures generally lead to similar results (Tan et. al, Introduction to Data Mining). This is not entirely surprising given that both mesaures are particular cases of Tsallis entropy $H_\beta = \frac{1}{\beta-1}(1 - \sum_{k=0}^{K-1} p^{\beta}_{mk})$. For Gini index $\beta=2$, while cross entroy is recovered by taking $\beta \rightarrow 1$.

Cross *entropy* might give higher scores to balanced partitions when there are many classes though. See "Technical Note: Some Properties of Splitting Criteria" (Breiman , 1996). On the other hand working in log scale *might* introduce computational overhead.

In [187]:

```
cls = DecisionTreeClassifier(criterion='gini', splitter='best')
est = cls.fit(iris.data, iris.target)
zip(iris.feature_names, cls.feature_importances_)
plt = pd.DataFrame(zip(iris.feature_names, cls.feature_importances_),
columns=['feature', 'Gini importance']).plot(kind='bar', x='feature')
_ = plt.set_title('Classification tree on the Iris dataset (criterion=gini)')
```

In [185]:

```
cls = DecisionTreeClassifier(criterion='entropy', splitter='best')
est = cls.fit(iris.data, iris.target)
zip(iris.feature_names, cls.feature_importances_)
plt = pd.DataFrame(zip(iris.feature_names, cls.feature_importances_),
columns=['feature', 'Gini importance']).plot(kind='bar', x='feature')
_ = plt.set_title('Classification tree on the Iris dataset (criterion=entropy)')
```

*random_state*, that controls the randomization of sampling at each split. If we set this parameter to a constant, we are guaranteed to produce the same splits for each run of the algorithm.

*sklearn.ensemble.RandomForestRegressor* and *sklearn.ensemble.RandomForestClassifier* are interaces for building ensembles of DecisionTreeRegressor and DecisionTreeClassifier respectively. The criteria for determining node impurity are the same as we've seen before: *mse* for regression trees and *gini* or *entropy* for classification trees. The logic for growing and combining the ensemble output is implemented in *sklearn.ensemble.ForestRegressor* and *sklearn.ensemble.ForstClassifier*

Random Forest is known to be a robust classifier that does not overfit; the authors claim no need for cross-validation or even using a test set for model selection (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#features). An estimate of the generalization error can be obtains as part of the algorithm run, by leaving out a portion of the bootstrap dataset used for fitting each tree and use it to build a test dataset across the $N$ trees with all data points left out by all trees. That is, but building an out-of-bag test set on the go. This behaviour is controlled by the boolean *oob_score* parameter. On a fitted model, the *oob_score* argument will then report the error estimate for the ensemble.

In sklearn the out-of-bag error is calculated in *_set_oob_score* in ForestRegressor and ForestClassifier. For regression trees, the oob score is expressed as the sum of the $R^2$ of each predictor on the oob dataset, averaged over the number of predictors. For classification trees, it is the vote given by each predictors *oob_decision_function* that for a forest of $k$ trees is a list of

```
decision = (predictions[k] /
predictions[k].sum(axis=1)[:, np.newaxis])
```

*predictions[k]* is the sum of class probabilities predicted by the *predict_proba(X)* method, for some test input $X$, that for a singe tree $k$ is the fraction of samples of the same class in a leaf.

*feature_importances_* property. This is computed as

```
sum(all_importances) / len(self.estimators_)
```

In the toy examples above, I've dealt only with numerical features. Fitting models with categorical variables might be a bit confusing if you come from evirnoment like R, where the most common implementations are either able to handle categorical variables natively, or take care of encoding them. In sklearn, a bit more work is necessary to encode categorical variables using eg. Hashing Trick (sklearn.feature_extraction.FeatureHasher) and One-Hot-Encoding (). Alternatively, Pandas provides the convenient DataFrame.get_dummies() method.

Suprisingly, to me at least, when growing trees on large datasets sklearn behaves well even with integer enconding of categorical features. See https://stackoverflow.com/questions/15821751/how-to-use-dummy-variable-to-represent-categorical-data-in-python-scikit-learn-r and https://github.com/szilard/benchm-ml/issues/1