In [1]:
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False

from IPython.display import SVG

def show_image(filename, figsize=None):
if figsize:
plt.figure(figsize=figsize)



# TreeBoost原理和实现（sklearn）简介¶

### 0. 前言¶

sklearn代码版本：

~/W/s/sklearn ❯❯❯ git log -n 1
commit d161bfaa1a42da75f4940464f7f1c524ef53484f
Author: John B Nelson <[email protected]>
Date:   Thu May 26 18:36:37 2016 -0400



GBDT模块位于sklearn/ensemble/gradient_boosting.py文件，类的关系比较简单。

In [2]:
SVG("./res/Main.svg")

Out[2]:

### 1. GBDT优化¶

1. 对损失导数求偏导；
2. 训练决策树，拟合偏导；
3. 寻优模型权值；
4. 将训练的模型，加到叠加模型中。

• $F_0(x) = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L(y_i, \rho)$

• For $m=1$ to $M$ do:

1. $\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$

2. $\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$

3. $\rho_m = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right)$
4. $F_m(x) = F_{m-1}(x) + l_r \rho_m h(x; \mathbf{a}_m)$

748     def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
749                    random_state, X_idx_sorted, X_csc=None, X_csr=None):
750         """Fit another stage of n_classes_ trees to the boosting model. """
751 #+--  8 lines: assert sample_mask.dtype == np.bool----------
759
760             residual = loss.negative_gradient(y, y_pred, k=k,
761                                               sample_weight=sample_weight)
762
763             # induce regression tree on residuals
764             tree = DecisionTreeRegressor(
765                 criterion='friedman_mse',
766                 splitter='best',
767 #+---  7 lines: max_depth=self.max_depth,------------------
774                 presort=self.presort)
775
776 #+---  8 lines: if self.subsample < 1.0:------------------
784                 tree.fit(X, residual, sample_weight=sample_weight,
785                          check_input=False, X_idx_sorted=X_idx_sorted)
786
787             # update tree leaves
788 #+--  5 lines: if X_csr is not None:---------------------
793                 loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
795                                              self.learning_rate, k=k)
796
797             # add tree to ensemble
798             self.estimators_[i, k] = tree
799
800         return y_pred


• 760L是求解偏导。
• 784L是生成决策树。注意，对于MSE，$\beta=1$。765L树的评价函数是friedman_mse，它是MSE的变种。我猜测$\beta$是一致的，后续有时间再深究。
• 793L是TreeBoost做的改进，将第3步寻优全局解$\rho$转化到树内部，后面主要内容就在这。
• 798L是加回到累加模型。

[1]: Friedman - Greedy function approximation: A gradient boosting machine

#### 1.0 Least squares regression¶

\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= y_i - F_{m-1}(x_i), \quad i = 1, 2, \dotsc, N \\ \end{align}

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\beta \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \beta h(x_i; \mathbf{a}_m) \right)^2 \quad \text{符号替换}\\ &= \beta_m \end{align}

• $F_0(x) = \bar{y}$

• For $m=1$ to $M$ do:

1. $\tilde{y}_i = y_i - F_{m-1}(x_i), \quad i=1, N$
2. $(\rho_m, \mathbf{a}_m) = \operatorname{arg \, min}_{\mathbf{a}, \rho} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \rho h(x_i; \mathbf{a}) \right ]^2$
3. $F_m(x) = F_{m-1}(x) + l_r \rho_m h(x; \mathbf{a}_m)$

sklearn中代码如下：

274 class LeastSquaresError(RegressionLossFunction):
275     """Loss function for least squares (LS) estimation.
276     Terminal regions need not to be updated for least squares. """
277     def init_estimator(self):
278         return MeanEstimator()
279
280     def __call__(self, y, pred, sample_weight=None):
281         if sample_weight is None:
282             return np.mean((y - pred.ravel()) ** 2.0)
283         else:
284             return (1.0 / sample_weight.sum() *
285                     np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))
286
287     def negative_gradient(self, y, pred, **kargs):
288         return y - pred.ravel()
289
290     def update_terminal_regions(self, tree, X, y, residual, y_pred,
292                                 learning_rate=1.0, k=0):
293         """Least squares does not need to update terminal regions.
294
295         But it has to update the predictions.
296         """
297         # update predictions
298         y_pred[:, k] += learning_rate * tree.predict(X).ravel()
299
300     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
301                                 residual, pred, sample_weight):
302         pass


\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= \operatorname{sign}(y_i - F_{m-1}(x_i))\\ \end{align}

$L(y, F)$是分段函数，它的导数在正数区间恒为1，负数区间恒为-1，而在间段点$F=0$处是不可导的，人为规定为0，所以可以用sign函数来描述：

$$\operatorname{sign}(x) := { \begin{cases} -1 & {\text{if }} x<0, \\ 0 & {\text{if }} x=0, \\ 1 & {\text{if }} x>0. \end{cases} }$$

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \big | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{median}_W \left \{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} \right \}_1^N, \quad w_i = \big | h(x_i; \mathbf{a}_m) \big | \end{align}

#### 推导细节¶

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{median}_W \left \{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} \right \}_1^N, \quad w_i = \big | h(x_i; \mathbf{a}_m) \big | \end{align}

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ \end{align}
##### 无加权¶

In [3]:
show_image("./res/rho_m.png")


1. 对于最左端$\rho < t_1$，有$f(\rho) = \sum_{i=1}^N | t_i - \rho | = \sum_{i=1}^N (t_i - \rho)$。很明显，对于每个子项，有$t_i - \rho > 0$，且$\rho \to t_1$时，$(t_i - \rho)$单减。也就是说，随着$\rho$从左向右靠近$t_1$点，所有子项都在减小，则总和$f(\rho)$单减。

2. 选定任一区间，$t_k \leq \rho \leq \rho + d \leq t_{k+1}$，有：

\begin{align} f(\rho + d) &= \displaystyle \sum_{i=1}^N | t_i - (\rho + d) | \\ &= \sum_{i=1}^k (\rho + d - t_i) + \sum_{i=k+1}^N (t_i - (\rho + d)) \\ &= \sum_{i=1}^k (\rho - t_i) + \sum_{i=1}^k d + \sum_{i=k+1}^N (t_i - \rho) + \sum_{i=k+1}^N -d \\ &= \sum_{i=1}^N | t_i = \rho | + k d + (N - (k+1) + 1) \times -d \\ &= f(\rho) + d \times ( k - N + (k + 1 ) -1 ) \\ &= f(\rho) + d \times (2k - N) \end{align}

$$f(\rho + d) = \begin{cases} < f(\rho), \quad \text{when } k < \frac{N}{2} \\ = f(\rho), \quad \text{when } k = \frac{N}{2} \\ > f(\rho), \quad \text{when } k > \frac{N}{2} \\ \end{cases}$$

##### 有加权¶
$$\rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho |$$

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \sum_{k=1}^{w_i} | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{t_i \in T'} | t_i - \rho | \quad \text{$T'$ 是$t_i$扩增$w_i$倍的集合} \end{align}

1. 将$t_i$按升序排列。
2. 将$w_i$按对应顺序排，计算累积和$c_k = \sum_{i=1}^k w_i$。
3. 找到对应中位点$c_m = \frac{1}{2} c_N$对应的$t_m$，这个$t_m$就是$\operatorname{median}_W \{t_i\}$。

51 def _weighted_percentile(array, sample_weight, percentile=50):
52     """Compute the weighted percentile of array with sample_weight. """
53     sorted_idx = np.argsort(array)
54
55     # Find index of median prediction for each sample
56     weight_cdf = sample_weight[sorted_idx].cumsum()
57     percentile_idx = np.searchsorted(
58         weight_cdf, (percentile / 100.) * weight_cdf[-1])
59     return array[sorted_idx[percentile_idx]]


### 2. 从GBDT到TreeBoost¶

#### 2.0 回归决策树¶

• For $m=1$ to $M$ do:

1. $\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$

2. $\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$

3. $\rho_m = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right)$
4. $F_m(x) = F_{m-1}(x) + \rho_m h(x; \mathbf{a}_m)$
注意，为了描述方便，我移除了学习率这个乘数。

TreeBoost将外部的寻优参数内化到决策树内部，从而既让模型更精细，又加快了运行速度。而要让外部参数进入到决策树$h(x_i; \mathbf{a}_m)$，首要的问题是得打开这个函数，即使用解析式来描述。换句话说，我们需要对决策树建立数学模型。

$$h(x; \{b_j, R_j\}_1^J) = \displaystyle \sum_{j=1}^J b_j \, \mathbf{1}(x \in R_j)$$

\begin{align} F_m(x) &= F_{m-1}(x) + \rho_m h(x; \mathbf{a}_m) \\ &= F_{m-1}(x) + \rho_M \displaystyle \sum_{j=1}^J b_{jm} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\rho_M b_{jm}} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\gamma_{jm}} \, \mathbf{1}(x \in R_{jm}) \end{align}

$$\{\gamma_{jm}\}_1^J = \displaystyle \operatorname{arg \, min}_{\{\gamma_j\}_1^J} \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \sum_{j=1}^J \gamma_j \mathbf{1}(x \in R_{jm}) \right )$$

$$\gamma_{jm} = \displaystyle \operatorname{arg \, min}_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)$$

$$\gamma_{jm} = \displaystyle \operatorname{median}_{x_i \in R_{jm}} \{ y_i - F_{m-1}(x_i) \}$$

• $F_0(x) = \operatorname{median}\{y_i\}_1^N$
• For $m=1$ to $M$ do:
1. $\tilde{y}_i = \operatorname{sign}(y_i - F_{m-1}(x_i)), \, i=1, N$
2. $\{R_{jm}\}_1^J = \text{$J$terminal node tree} \big (\{\tilde{y}_i, x_i\}_1^N \big )$
3. $\gamma_{jm} = \displaystyle \operatorname{median}_{x_i \in R_{jm}} \{ y_i - F_{m-1}(x_i)\}, \, j=1, J$
4. $F_m(x) = F_{m-1}(x) + \displaystyle \sum_{j=1}^J \gamma_{jm} \mathbf{1}(x \in R_{jm})$

305 class LeastAbsoluteError(RegressionLossFunction):
306     """Loss function for least absolute deviation (LAD) regression. """
307     def init_estimator(self):
308         return QuantileEstimator(alpha=0.5)
309
310     def __call__(self, y, pred, sample_weight=None):
311         if sample_weight is None:
312             return np.abs(y - pred.ravel()).mean()
313         else:
314             return (1.0 / sample_weight.sum() *
315                     np.sum(sample_weight * np.abs(y - pred.ravel())))
316
317     def negative_gradient(self, y, pred, **kargs):
318         """1.0 if y - pred > 0.0 else -1.0"""
319         pred = pred.ravel()
320         return 2.0 * (y - pred > 0.0) - 1.0
321
322     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
323                                 residual, pred, sample_weight):
325         terminal_region = np.where(terminal_regions == leaf)[0]
326         sample_weight = sample_weight.take(terminal_region, axis=0)
327         diff = y.take(terminal_region, axis=0) - pred.take(terminal_region, axis=0)
328         tree.value[leaf, 0, 0] = _weighted_percentile(diff, sample_weight, percentile=50)


#### 2.2 M-Regression¶

LS_Boost运行快，LDA_TreeBoost鲁棒性好，能不能将两者结合起来呢？M-Regression的初衷正是来源于此，它用阈值$\delta$，比如说三倍方差，将$|y-F|$误差分隔成两部份：在阈值内的认定是正常误差，用LS_Boost来约束；在阈值外认定是长尾和坏值，用LDA_TreeBoost来抵抗。通过调整$\delta$值来控制平衡，从而达到各取其长的好处。

$$L(y, F) = \begin{cases} \frac{1}{2} (y - F)^2 \quad & |y - F| \leq \delta \\ \delta \cdot \left (\big |y - F \big | - \frac{\delta}{2} \right ) \quad & |y - F| > \delta \end{cases}$$

#### 2.3 二分类逻辑回归树¶

$$L(y, F) = \log (1 + e^{-2 y F}), \quad y \in \{-1, 1\}$$

\begin{align} \tilde{y_i} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= - \frac{1}{1 + e^{-2yF}} \cdot 1 \cdot e^{-2yF} \cdot -2y \\ &= 2y \frac{e^{-2yF}}{1 + e^{-2yF}} \\ &= \frac{2y}{1 + e^{2yF}} \\ &= \frac{2 y_i}{1 + e^{2 y_i F_{m-1}(x_i)}} \\ \end{align}

$$\rho_m = \displaystyle \operatorname{arg \, min}_\rho \sum_{i=1}^{N} \log \left ( 1 + e^{-2 y_i \big ( F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \big )} \right )$$

$$\gamma_{jm} = \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) })$$

$$\gamma_{jm} = \displaystyle \sum_{x_i \in R_{jm}} \tilde{y}_i \Big / \sum_{x_i \in R_{jm}} |\tilde{y}_i| (2 - |\tilde{y}_i|)$$

• $F_0(x) = \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}}$

• For $m=1$ to $M$ do:

1. $\tilde{y}_i = \frac{2 y_i}{1 + e^{2 y_i F_{m-1}(x_i)}}, \quad i=1, N$
2. $\{R_{jm}\}_1^J = \text{$J$terminal node tree} \big (\{\tilde{y}_i, x_i\}_1^N \big )$
3. $\gamma_{jm} = \displaystyle \sum_{x_i \in R_{jm}} \tilde{y}_i \Big / \sum_{x_i \in R_{jm}} |\tilde{y}_i| (2 - |\tilde{y}_i|), \quad j=1, J$
4. $F_m(x) = F_{m-1}(x) + \displaystyle \sum_{j=1}^J \gamma_{jm} \mathbf{1}(x \in R_{jm})$

sklearn中代码如下：

106 class LogOddsEstimator(BaseEstimator):
107     """An estimator predicting the log odds ratio."""
108     scale = 1.0
109
110     def fit(self, X, y, sample_weight=None):
111         # pre-cond: pos, neg are encoded as 1, 0
112         if sample_weight is None:
113             pos = np.sum(y)
114             neg = y.shape[0] - pos
115         else:
116             pos = np.sum(sample_weight * y)
117             neg = np.sum(sample_weight * (1 - y))
118
119         if neg == 0 or pos == 0:
120             raise ValueError('y contains non binary labels.')
121         self.prior = self.scale * np.log(pos / neg)
122
123     def predict(self, X):
124         check_is_fitted(self, 'prior')
125
126         y = np.empty((X.shape[0], 1), dtype=np.float64)
127         y.fill(self.prior)
128         return y


466 class BinomialDeviance(ClassificationLossFunction):
467 #    """Binomial deviance loss function for binary classification.
468 #+-- 10 lines: Binary classification is a special case; here, we only need to-------------------------
478
479     def init_estimator(self):
480         return LogOddsEstimator()
481
482 #+--  9 lines: def __call__(self, y, pred, sample_weight=None):---------------------------------------
491
492     def negative_gradient(self, y, pred, **kargs):
493         """Compute the residual (= negative gradient). """
494         return y - expit(pred.ravel())
495
496     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
497                                 residual, pred, sample_weight):
498         """Make a single Newton-Raphson step.
499
500         our node estimate is given by:
501
502             sum(w * (y - prob)) / sum(w * prob * (1 - prob))
503
504         we take advantage that: y - prob = residual
505         """
506         terminal_region = np.where(terminal_regions == leaf)[0]
507         residual = residual.take(terminal_region, axis=0)
508         y = y.take(terminal_region, axis=0)
509         sample_weight = sample_weight.take(terminal_region, axis=0)
510
511         numerator = np.sum(sample_weight * residual)
512         denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
513
514         if denominator == 0.0:
515             tree.value[leaf, 0, 0] = 0.0
516         else:
517             tree.value[leaf, 0, 0] = numerator / denominator
518
519     def _score_to_proba(self, score):
520         proba = np.ones((score.shape[0], 2), dtype=np.float64)
521         proba[:, 1] = expit(score.ravel())
522         proba[:, 0] -= proba[:, 1]
523         return proba


[3]: Jerome Friedman - Additive logistic regression: a statistical view of boosting

\begin{align} & F(x) = \frac{1}{2} \log \left [ \frac{\operatorname{Pr}(y = 1 | x)}{\operatorname{Pr}(y = -1 | x)} \right ] \\ & \operatorname{Pr}(y = 1 | x) + \operatorname{Pr}(y = -1 | x) = 1 \end{align}

\begin{align} & \operatorname{Pr}(y = 1 | x) = \frac{1}{1 + e^{-2 F(x)}} \\ & \operatorname{Pr}(y = -1 | x) = \frac{1}{1 + e^{2 F(x)}} \\ \end{align}

\begin{align} p_{+}(x) &= \operatorname{\widehat{Pr}}(y = 1 | x) = \frac{1}{1 + e^{-2 F_M(x)}} \\ p_{-}(x) &= \operatorname{\widehat{Pr}}(y = -1 | x) = \frac{1}{1 + e^{2 F_M(x)}} \\ \end{align}

$$\hat{y}(x) = 2 \cdot \mathbf{1}[c(-1,1) p_{+}(x) > c(1,-1) p_{-}(x)] - 1$$

##### Influence trimming¶

\begin{align} \gamma_{jm} &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) }) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 \color{blue}{y_i F_{m-1}(x_i)}} \cdot e^{-2 y_i \gamma} ) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \phi(x_i) \end{align}

$$w_i = \frac{\partial^2 L}{\partial F^2} = |\tilde{y}_i| (2 - |\tilde{y}_i|)$$

$$\displaystyle \sum_{i=1}^{l(\alpha)} w_{(i)} = \alpha \sum_{i=1}^{N} w_i$$

### 3. 参数与正则¶

#### 3.0 Regularization¶

lack of fit(LOF):

• regression
• average absolute error
• classification
• minus twice log-likelihood(deviance)
• misclassification error-rate

1. 简单：最终模型：$F_\nu(x) = \bar{y} + \nu \cdot (F_M(x) - \bar{y})$
2. 复杂：每个迭代模型：$F_m(x) = F_{m-1}(x) + \nu \cdot \rho_m h(x; \mathbf{a}_m, \quad 0 < \nu \leq 1$

#### 3.1 Tree boosting的参数¶

meta-parameter:

• $M$: the number of iterations.
• $\nu$: the learning rate.
• $J$: the fixed number of terminal nodes.
The best tree size $J$ is governed by the effective interaction order of the target $F*(x)$, while it is unknown. $\to$ cross-validation.

### 4. 可解释性¶

#### 4.0 Relative importance of input variables¶

The relative influences $I_j$, of the individual inputs $x_j$, on the variation of $\hat{F}(x)$ over the joint input variable distributation:

$$I_j = \left ( E_x \left [ \frac{\partial \hat{F}(x)}{\partial x_j} \right ]^2 \cdot \operatorname{var}_x[x_j] \right )^{1/2}$$

$$\hat{I}_j^2(T) = \displaystyle \sum_{t=1}^{J-1} \hat{i}_t^2 \mathbf{1}(v_t = j)$$

$$\hat{I}^2_j = \frac{1}{M} \displaystyle \sum_{m=1}^M \hat{I}^2_j (T_m)$$

1201     def feature_importances_(self):
1202         """Return the feature importances (the higher, the more important the
1203            feature).
1204
1205         Returns
1206         -------
1207         feature_importances_ : array, shape = [n_features]
1208         """
1209         self._check_initialized()
1210
1211         total_sum = np.zeros((self.n_features, ), dtype=np.float64)
1212         for stage in self.estimators_:
1213             stage_sum = sum(tree.feature_importances_
1214                             for tree in stage) / len(stage)
1215             total_sum += stage_sum
1216
1217         importances = total_sum / len(self.estimators_)
1218         return importances


#### 4.1 Partial dependence plots¶

Partial dependence主要是用于可视化一维或二维变量对于整体总结果的影响。它的计算思路非常简单，令有特征集合$X_s \cup X_c = X$，则$f(X_s = x_s) = \operatorname{Avg}(f(X_s = x_s, X_{ci}))$。也就是对于特定的$x_s$值，算出它的响应均值。

ref: Partial Dependency Plots and GBM

### 5. 总结¶

In [ ]:

`