#!/usr/bin/env python # coding: utf-8 # #Introduction #
This notebook is a tutorial on the very powerful and elegent ensemble learning method known as Gradient Boosted Trees. This tutorial is organized as follows:
#
The following sections go into increasing technical depth about GBTs. The first part introduces the basic functional form of a GBT. This should be the minimum that one learns to have at least a baseline understanding of the method. Next is presented the algorithm. Understanding the training algorithm leads to a nice intuitve understanding as to why the method works so well. It also helps to solidify the roles of the different hyperparameters. The last section derives the GBT algorithm. Knowing this is a nice way to geek out. # # #
# # ###Basic Functional Form #A Gradient Boosted Tree classifier can be thought of as a linear classifier over features that are learned specifically on your training data. The GBT classifier has the following functional form: # #
#
Learning a GBT classifier fits within the ERM framework. We showed above that the basic functional form of a GBT is $F^m(X)=\sum\limits_{j=1}^m f_j(X)$. To find the optimal $F^m(X)$, we define the following optimization problem:
#
#
In the line search steps above, we used a decision tree to make a best least squares approximation to the gradient of the loss function. This is well motivated mathematically, but there is also a very intuitive interpretation to this. The gradient of $\mathbb{L}$ with respect to $f(X)$ is essentially the residual between $Y$ and $F^{m-1}(X)$. So for ordinary least squares, the gradient is $\nabla_f \mathbb{L} = Y-F^{m-1}(X)$, and for classification, we have $\nabla_f \mathbb{L} = Y-p(X)$.
# When we think about the tree fitting step above, what we are really doing is fitting a tree to the residuals of the $(m-1)th$ step. Each boosting step then emphasizes the data points where predictions are the most incorrect.
#
As with any prediction model, we need to take care to avoid over-fitting. This is done by adding a dampening constant to the estimate of $F^m(X)$. Specifically, we introducte $\nu$, such that:
#
# A good way to understand gradient boosting is to work with a small dataset in which we know the true data generating function $F^*(X)$. For this example we have $X$ uniformly distributed in the unit square, and $Y$ is distributed as such:
#
#
# In[40]: #Generate random data with classes separated by a circle with some gaussian noise import numpy as np import pandas as pd get_ipython().run_line_magic('matplotlib', 'inline') n=10000 circle = [(0.5, 0.5), 0.25] sig = 0.1 X1 = np.random.random(n) X2 = np.random.random(n) Y = 1*(((X1 - circle[0][0])**2 + (X2 - circle[0][1])**2 + np.random.randn(n)*sig) < circle[1]**2) X = pd.DataFrame({'X1':X1, 'X2':X2}) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(X1[(Y==0)], X2[(Y==0)], 'b.', markersize=3) plt.plot(X1[(Y==1)], X2[(Y==1)], 'r.', markersize=3) ax.axes.get_xaxis().set_visible(False) ax.axes.get_yaxis().set_visible(False) plt.title('Simulated 2-Class Data') #
# Now let's compare training and test error as a function of the # of estimators. Note that since this data has only 2 features, we only need a max_depth of 2 for the decision trees to express any interactions that may exist in the data. #
# In[2]: from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import confusion_matrix,roc_auc_score split_ind = 0.7 * len(Y) X_train = X[:split_ind] Y_train = Y[:split_ind] X_test = X[split_ind:] Y_test = Y[split_ind:] n_est_lim = 1000 gbc = GradientBoostingClassifier(n_estimators = n_est_lim, max_depth = 2) gbc.fit(X_train, Y_train) #
Now lets plot using the staged_predict_proba method for GradientBoostingClassifier. This method returns a generator that when called can give us the predictions at each stage of the model building process. This will let us see how training and test error (Log-Loss) evolve as we grow the model.
# In[17]: from pylab import rcParams rcParams['figure.figsize'] = 12, 5 def LogLossP(Pt, Yt): return -1*((Yt==1)*np.log(Pt)+(Yt==0)*np.log(1-Pt)).mean() p_train = gbc.staged_predict_proba(X_train) p_test = gbc.staged_predict_proba(X_test) ll_train = [] ll_test = [] for p in p_train: ll_train.append(LogLossP(p[:, 1], Y_train)) for p in p_test: ll_test.append(LogLossP(p[:, 1], Y_test)) fig = plt.figure() ax = fig.add_subplot(111) plot(np.arange(1, n_est_lim + 1), ll_train, 'b-', label='Train') plot(np.arange(1, n_est_lim + 1), ll_test, 'r-', label='Test') plt.title('Training and Testing Error as a Function of N-Estimators') plt.legend() #We can see above the typical pattern as we make a classifier more complex. In this case, the power of the classifier improves dramatically in the first few dozen iterations. After that, we see a continual improvement in Training error, but of course at the cost of good Test error. Notice though the cost of overfitting is fairly mild here. So having too many iterations is better than having too few.
#
# Next let's actually visualize what the classifier is doing. Remember that our function here is an additive function of weighted trees.
#
#
Now let's plot the decision surface of the classifier. I.e., we will look at:
#
In the above plots we can get a good sense of the bias-variance tradeoff of the GBT classifier. With a few iterations, we get a large box in the middle with very little curvature around it. As we add more boosts, we can see how the decision surface gets more spherical. With this also increase we also see the prediction region becoming more speckled with predictions of the negative class (blue). This is the GBT fitting noisy points within the circle. We can see how the log-loss decreases as we add more trees, but eventually it goes up again. Overall though, on this data, it is better to err on more tree than fewer. This method is more robust to high variance than to high bias. # #
# ##A Simulation Bakeoff # #In this section we'll compare Gradient Boosting to two other non-linear methods: Random Forests and SVM with Radial Basis Kernel. We'll first generate 2 datasets that have highly non-linear decision boundaries. The following two data sets are generated from the same underlying geometry, only one has a lot of noise added to it. #
# In[1]: from course_utils import * noisy_train = happyClass(0.015, 2000) clean_train = happyClass(0.0001, 2000) noisy_test = happyClass(0.015, 5000) clean_test = happyClass(0.0001, 5000) fig = plt.figure() plt.subplot(1, 2, 1) plt.title('Noisy Face') plt.plot(noisy_test.X1[(noisy_test.Y==0)], noisy_test.X2[(noisy_test.Y==0)], 'b.') plt.plot(noisy_test.X1[(noisy_test.Y==1)], noisy_test.X2[(noisy_test.Y==1)], 'r.') plt.subplot(1, 2, 2) plt.title('Clean Face') plt.plot(clean_test.X1[(clean_test.Y==0)], clean_test.X2[(clean_test.Y==0)], 'b.') plt.plot(clean_test.X1[(clean_test.Y==1)], clean_test.X2[(clean_test.Y==1)], 'r.') #Next we'll do the bakeoff. The goal here is to see how the above algorithms perform on the above simulated data. We also want to compare the decision surfaces so that we can pull a bit of geometric intuition about how each classifier operates. For each classifier type in this bakeoff we'll use SKLearn's GridsearchCV to perform cross-validation to choose a set of optimal hyper-parameters. # # #
# In[2]: from sklearn.grid_search import GridSearchCV from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import KFold from sklearn.svm import SVC boost_grid = {'n_estimators':[20, 50, 100, 500, 1000]} rf_grid = {'n_estimators':[20, 50, 100, 500, 1000], 'max_features':[1, 2]} svm_grid = {'C': [1000, 10000, 100000], 'gamma': [100, 10, 1, 0.1], 'kernel': ['rbf']} #We define a KFold object so that all tests get the same folds kf = KFold(noisy_train.shape[0], n_folds=5) boost_cv = GridSearchCV(GradientBoostingClassifier(), boost_grid, cv = kf, scoring = 'roc_auc') boost_cv.fit(noisy_train.drop('Y', 1), noisy_train.Y) rf_cv = GridSearchCV(RandomForestClassifier(), rf_grid, cv = kf, scoring = 'roc_auc') rf_cv.fit(noisy_train.drop('Y', 1), noisy_train.Y) svm_cv = GridSearchCV(SVC(), svm_grid, cv = kf, scoring = 'roc_auc') svm_cv.fit(noisy_train.drop('Y', 1), noisy_train.Y) #We need these steps to get the prob estimate off of the SVM svc = svm_cv.best_estimator_ svc.probability = True svc.fit(noisy_train.drop('Y', 1), noisy_train.Y) #Now let's plot these # #
# In[11]: def plotZgen(clf, dat, pc, t, fig): ''' This plots a 2d decision boundary given a trained classifier Note the data must have two fields X1 and X2 to work ''' plot_step = 0.02 x_min, x_max = dat['X1'].min(), dat['X1'].max() y_min, y_max = dat['X2'].min(), dat['X2'].max() xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),np.arange(y_min, y_max, plot_step)) Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] Z = Z.reshape(xx.shape) ax = fig.add_subplot(pc[0], pc[1], pc[2]) cs = plt.contourf(xx, yy, Z, cmap=plt.cm.cool) plt.plot(dat['X1'][(dat.Y==1)], dat['X2'][(noisy_test.Y==1)], 'r.', markersize = 2) plt.title(t) ax.axes.get_xaxis().set_visible(False) ax.axes.get_yaxis().set_visible(False) fig = plt.figure() plotZgen(boost_cv, noisy_test, (1, 3, 1), 'Gradient Boosted Tree', fig) plotZgen(rf_cv, noisy_test, (1, 3, 2), 'Random Forest', fig) plotZgen(svc, noisy_test, (1, 3, 3), 'SVM - Guassian Kernel', fig) # #References #The following list of references are recommended for a complete review of both the implementation and theory of Gradient Boosted Trees.
#