Bagging is a procedure that let's use bootstrapping to reduce the variance, and ultimately performance, of a prediction. The procedure can be defined as follows:
We have data $D=[(X_1, Y_1),(X_2, Y_2),...,(X_N, Y_N)]$ and we want to learn:$\:\: E[Y|X]=\hat{f}(X)$. Define a bootstrap sample $D^b$ as $N$ samples from $D$, sampled with replacement. Let $E^b[Y|X]=\hat{f}^b(X)$ be the function learned from training set $D^b$.
Our bagged prediction is then the mean of all estimates of $f^b(X)$. I.e.,
import numpy as np
import pandas as pd
import course_utils as bd
reload(bd)
#Generate Y, and X, where E[Y|X] is a 3rd order polynomial
betas = [0, 2, -2.5, 1]
n=200
sig=0.2
sp=20
x_init = np.random.uniform(0, 1, n)
e_init = np.random.normal(0, sig, n)
dat = bd.genY(x_init, e_init, betas)
dat = bd.makePolyFeat(dat, 6)
#Plot the data vs. the real curve
plt.plot(dat['x'], dat['y'], 'b.')
plt.plot(dat['x'], dat[['x','x2','x3']].dot(np.array(betas[1:])), 'r.')
[<matplotlib.lines.Line2D at 0x107877d50>]
#Now let's do bagged prediction using linear regression
from sklearn import linear_model
n_boots = 100
boot_est = dict()
x_grid = np.arange(0, 1, 0.01)
d_grid = pd.DataFrame(x_grid, columns=['x'])
d_grid = bd.makePolyFeat(d_grid, 6)
fig = plt.figure()
#Generate and plot each bootstrap model
for i in range(n_boots):
D_b = dat.iloc[np.random.randint(0, dat.shape[0], size=dat.shape[0])]
regr = linear_model.LinearRegression(fit_intercept=True)
regr.fit(D_b.drop('y', 1), D_b['y'])
boot_est[i] = regr.predict(d_grid)
plt.plot(x_grid, boot_est[i], color='0.75')
#Now aggregate the bootstrapped models for a single prediction
bag_est = pd.DataFrame(boot_est).mean(axis=1)
#For comparison, we'll also see what a single fit on the original data looks like
regr = linear_model.LinearRegression(fit_intercept=True)
regr.fit(dat.drop('y', 1), dat['y'])
non_bag = regr.predict(d_grid)
#Truth
truth = d_grid[['x','x2','x3']].dot(np.array(betas[1:]))
#Now Get MSE of estimates
mse_bag = round(np.sqrt(((truth-bag_est)**2).sum()), 3)
mse_non = round(np.sqrt(((truth-non_bag)**2).sum()), 3)
plt.plot(x_grid, non_bag,'k-', label='Non Bagged Estimate, MSE={}'.format(mse_non))
plt.plot(x_grid, bag_est,'r-', label='Bagged Estimate, MSE={}'.format(mse_bag))
plt.plot(d_grid['x'], d_grid[['x','x2','x3']].dot(np.array(betas[1:])), 'b--', label='Truth')
plt.title('Bagging Example')
plt.legend(loc=4)
<matplotlib.legend.Legend at 0x1090b05d0>
We can see in the above example that variance of the bootstrapped predictions is very high, especially around the extreme values of $X$, where there is less data. The goal of Bagging is to reduce the variance of the prediction (and thus improve accuracy). The lever that controls the amount of variance reduction is the number of bootstrap samples used.
When to use Bagging
According to the original Bagging paper by Leo Breiman, bagging will produce a much better reduction in training error when the underlying classifier or regression model is unstable, and sensitive to minor variations of the data $D$. The above example was meant to show Bagging at work. We used a linear regression on polynomial features, and even though the underlying model was misspecified (we used degree 6 where the truth is degree 3), the bagged estimate isn't too far from the estimate without bagging. One must be careful when choosing to use bagging, because it can actually hurt performance if the underlying model is fairly stable (if you rerun the above regression example with different data sets or different bootstrapping iterations, you might find that the bagged MSE is worse).
The Random Forest algorithm is probably the most well known and utilized implementation of the Bagging technique. A RF is an ensemble of Decision Trees, where both bagging and random feature selection are used to reduce the variance of the forest. The basic algorithm goes as follows:
Assume we have a data matrix $D=[X,Y]$ with $N$ records and $M$ features.
Train
For each $b$ of $B$ iterations:
We'll start by building a forest using SKlearn.ensemble. Check out here for a brief overview of Ensemble methods in Python. We'll also compare this to a single tree. This test is close to an off-the-shelf test, in the sense that we are not doing any intelligent hyper-parameter optimization. However, we do use a few well reasoned starting parameters.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import course_utils as bd
reload(bd)
f = '/Users/briand/Desktop/ds course/hw/Cell2Cell_data.csv'
dat=pd.read_csv(f, header=0, sep=',')
train, test = bd.trainTest(dat, 0.8)
lab = 'churndep'
#We'll build a RF and compare to a DT
clf_def = DecisionTreeClassifier(criterion='entropy', min_samples_leaf = 20)
clf_def = clf_def.fit(train.drop(lab, 1), train[lab])
dt_pred = clf_def.predict_proba(test.drop(lab,1))
rf_def = RandomForestClassifier(criterion='entropy', n_estimators=100)
rf_def = rf_def.fit(train.drop(lab, 1), train[lab])
rf_pred = rf_def.predict_proba(test.drop(lab,1))
Now compare the forest to the tree using AUC
from sklearn.metrics import roc_curve, auc, roc_auc_score
import course_utils as bd
reload(bd)
bd.plotAUC(test[lab], dt_pred[:,1], 'DT')
bd.plotAUC(test[lab], rf_pred[:,1], 'RF')
We can see that with little optimization involved, the RF is already a bit better. Let's see how much better we can do with a grid search on both the DT and the RF. We have already covered DT optimization in a different module, so here we'll focus on the input parameters of the RandomForestClassifier object that let us tune the forest.
The following two are forest specific:
Usually we would want to use cross-validation for performance tuning. For Random Forests, this becomes problematic, as cross-validation with RF's can be painfully slow. That's because each cross-validation step requires building k*n_estimators trees. One advantage of a random forest is that it performs what is called an "out-of-bag" error calculation. Remember that each tree is built from a bootstrap sample of the data, so that for each tree, some portion of the data is not used for that tree. The RF method computes an out-of-bag prediction for each record $[x_i, y_i]$ by averaging the prediction $f^b(x_i,y_i)$ on record $i$ for the bootstrap iterations in which record $i$ was not chosen in the bootstrap. The out-of-bag prediction can then be used to compute out-of-sample error for model selection and validation. This method should be equivalent to $N$-fold cross-validation.
We'll start by building a tree of oob predictions enabled and then compare the performance to a true hold out set.
rf_def = RandomForestClassifier(criterion='entropy', n_estimators=100, oob_score=True)
rf_def = rf_def.fit(train.drop(lab, 1), train[lab])
rf_pred_test = rf_def.predict_proba(test.drop(lab,1))
rf_pred_train = rf_def.predict_proba(train.drop(lab,1))
bd.plotAUC(test[lab], rf_pred_test[:,1], 'Test')
bd.plotAUC(train[lab], rf_pred_train[:,1], 'Train')
bd.plotAUC(train[lab], rf_def.oob_decision_function_[:,1], 'OOB')
We can see two very important things here.
n_est = [50, 100, 200, 500, 1000]
m_feat = [1, 3, 6, 11]
aucs_oob = {}
aucs_test = {}
for m in m_feat:
aucs_oob[m] = []
aucs_test[m] = []
for n in n_est:
rf_oob = RandomForestClassifier(criterion='entropy', n_estimators=n, max_features=m, oob_score=True)
rf_oob = rf_oob.fit(train.drop(lab, 1), train[lab])
aucs_oob[m].append(roc_auc_score(train[lab], rf_oob.oob_decision_function_[:,1]))
aucs_test[m].append(roc_auc_score(test[lab], rf_oob.predict_proba(test.drop(lab,1))[:,1]))
#We'll plot in this block
x = np.log2(np.array(n_est))
for m in m_feat:
plt.plot(x, aucs_oob[m], label='max_feat={}'.format(m))
plt.title('OOB AUC by Max Feat and N-Estimators')
plt.xlabel('Log2(N-Estimators)')
plt.ylabel('OOB-AUC')
plt.legend(loc=4, ncol=2, prop={'size':10})
<matplotlib.legend.Legend at 0x10e1a7810>
We can see in the plot above that the forest gets better with more trees, but we see that this effect tapers off asymptotically. We also see somewhat of a Golidlocks phenomenon with respect to the optimal number of features. Having too few features is suboptimal (i.e., max_feat=1), probably due to high bias in the trees. Conversely, having too many features (max_feat=11) is also bad, and this is because the individual trees are too correlated with each other, which prevents the Bagging approach from reducing the overall variance (see the analysis below to understand the math behind this statement). The right number of features is in the middle. We see that choosing $3$ or $6$ perform relatively the same. $3$ happens to be closer to the good heuristic default, which is max_feat=sqrt(total feats).
Next, for the purpose of educating ourselves, we'll also look at the Test set AUC's on each of the design options above.
x = np.log2(np.array(n_est))
for m in m_feat:
plt.plot(x, aucs_test[m], label='max_feat={}'.format(m))
plt.title('Test AUC by Max Feat and N-Estimators')
plt.xlabel('Log2(N-Estimators)')
plt.ylabel('Test-AUC')
plt.legend(loc=4, ncol=2, prop={'size':10})
<matplotlib.legend.Legend at 0x10a612d10>
We can see similar patterns in the true holdout evaluation as we did in the OOB evaluation. We also tend to see a few more things. For one thing, having more trees actually hurts the performance for some levels of max_feat. We also see that the best choice for max_feat was indeed $3$.
Much like with Decision Trees, the Random Forest Classifier has a built in mechanism for evaluating feature importance. Quoting the sklearn documentation:
Features used at the top of the tree contribute to the final prediction of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.
The above computation is made for each feature in each tree and then averaged over all trees in the forest. The Random Forest Classifier returns an attribute with an importance score for each feature, and these scores sum to $1$ across all features.
We'll train a RF on the best options found above and then look at the feature importances.
from sklearn.ensemble import RandomForestClassifier
rf_best = RandomForestClassifier(criterion='entropy', n_estimators=500)
rf_best = rf_best.fit(train.drop(lab, 1), train[lab])
%matplotlib inline
cols = train.drop(lab, 1).columns.values
rf_fi = rf_best.feature_importances_
dt_fi = clf_def.feature_importances_
fig, ax = plt.subplots()
width=0.35
ax.bar(np.arange(len(cols)), rf_fi, width, color='r', label='RF')
ax.bar(np.arange(len(cols))+width, dt_fi, width, color='b', label='DT')
ax.set_xticks(np.arange(len(cols)))
ax.set_xticklabels(cols, rotation=45)
plt.title('Feature Importance from DT and RF')
ax.set_ylabel('Normalized Gini Importance')
plt.legend(loc=1)
<matplotlib.legend.Legend at 0x1426bb390>
We can show with a little algebra why Random Forests lead to generally better estimation performance.
Let's say that the variance of a single tree of size $N$ and $p$ features is: $Var(T_b(X))=\sigma^2$. Similarly the variance of the Random Forest is the variance of the sum of such trees, i.e., $Var(RF(X))=Var(\frac{1}{B}\sum\limits_{b=1}^B \: T_b(X)).$ Because the invidual trees are trained with overlapping records and features, the individual tree estimates are correlated. Thus, the variance does not factor completely cleanly into a sum of individual tree variances. We'll solve this by using the following relation: