This chapter will teach you how to make your XGBoost models as performant as possible. You'll learn about the variety of parameters that can be adjusted to alter the behavior of XGBoost and how to tune them efficiently so that you can supercharge the performance of your models. This is the Summary of lecture "Extreme Gradient Boosting with XGBoost", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use xgb.cv()
inside a for loop and build one model per num_boost_round
parameter.
Here, you'll continue working with the Ames housing dataset. The features are available in the array X
, and the target vector is contained in y
.
df = pd.read_csv('./dataset/ames_housing_trimmed_processed.csv')
X, y = df.iloc[:, :-1], df.iloc[:, -1]
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Creata the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":3}
# Create list of number of boosting rounds
num_rounds = [5, 10, 15]
# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []
# Interate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
num_boost_round=curr_num_rounds, metrics='rmse',
as_pandas=True, seed=123)
# Append final round RMSE
final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))
num_boosting_rounds rmse 0 5 50903.299479 1 10 34774.194010 2 15 32895.097656
Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv()
. This is done using a technique called early stopping.
Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse"
in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds
parameter in xgb.cv()
with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds
is reached, then early stopping does not occur.
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}
# Perform cross-validation with early-stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, nfold=3, params=params, metrics="rmse",
early_stopping_rounds=10, num_boost_round=50, as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std 0 141871.635417 403.636200 142640.651042 705.559164 1 103057.033854 73.769960 104907.664063 111.119966 2 75975.966146 253.726099 79262.054687 563.764349 3 57420.529948 521.658354 61620.136719 1087.694282 4 44552.955729 544.170190 50437.561198 1846.446330 5 35763.947917 681.797248 43035.660156 2034.472841 6 29861.464193 769.572153 38600.881510 2169.798065 7 25994.675781 756.519243 36071.817708 2109.795430 8 23306.836588 759.238254 34383.184896 1934.546688 9 21459.770833 745.624687 33509.141276 1887.375284 10 20148.720703 749.612103 32916.807292 1850.894702 11 19215.382162 641.387014 32197.832682 1734.456935 12 18627.389323 716.256596 31770.852214 1802.156241 13 17960.694661 557.043073 31482.781901 1779.124534 14 17559.736979 631.412969 31389.990234 1892.319927 15 17205.712891 590.171393 31302.881511 1955.166733 16 16876.572591 703.632148 31234.059896 1880.707172 17 16597.663086 703.677647 31318.348308 1828.860391 18 16330.460937 607.274494 31323.634766 1775.910706 19 16005.972656 520.471365 31204.135417 1739.076156 20 15814.300456 518.605216 31089.863281 1756.020773 21 15493.405599 505.616447 31047.996094 1624.673955 22 15270.734375 502.018453 31056.916015 1668.042812 23 15086.382161 503.912447 31024.984375 1548.985605 24 14917.607747 486.205730 30983.686198 1663.131107 25 14709.589518 449.668010 30989.477865 1686.668378 26 14457.286458 376.787206 30952.113281 1613.172049 27 14185.567383 383.102234 31066.901042 1648.534606 28 13934.067057 473.464991 31095.641927 1709.225000 29 13749.645182 473.671021 31103.887370 1778.880069 30 13549.836263 454.898488 30976.085938 1744.515079 31 13413.485026 399.603470 30938.469401 1746.054047 32 13275.916016 415.408786 30930.999349 1772.470428 33 13085.878255 493.792509 30929.056641 1765.541578 34 12947.181641 517.790106 30890.629557 1786.510976 35 12846.027344 547.732372 30884.492839 1769.730062 36 12702.378906 505.523126 30833.541667 1691.002487 37 12532.244140 508.298516 30856.688151 1771.445059 38 12384.055013 536.225042 30818.016927 1782.786053 39 12198.444010 545.165502 30839.392578 1847.326928 40 12054.583333 508.841412 30776.964844 1912.779587 41 11897.036784 477.177937 30794.702474 1919.674832 42 11756.221354 502.992395 30780.957031 1906.820066 43 11618.846680 519.837469 30783.753906 1951.260704 44 11484.080404 578.428621 30776.731771 1953.446772 45 11356.552734 565.368794 30758.542969 1947.454481 46 11193.557943 552.299272 30729.972005 1985.699338 47 11071.315104 604.089876 30732.663411 1966.999196 48 10950.778320 574.862779 30712.240885 1957.751118 49 10824.865885 576.665756 30720.853516 1950.511977
It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning the "eta"
, also known as the learning rate.
The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta"
penalizing feature weights more strongly, causing much stronger regularization.
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}
# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []
# Systematicallyvary the eta
for curr_val in eta_vals:
params['eta'] = curr_val
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,
as_pandas=True)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=['eta', 'best_rmse']))
eta best_rmse 0 0.001 195736.401042 1 0.010 179932.187500 2 0.100 79759.408854
In this exercise, your job is to tune max_depth
, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary
params = {"objective":"reg:squarederror"}
# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []
for curr_val in max_depths:
params['max_depth'] = curr_val
# Perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,
as_pandas=True)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))
max_depth best_rmse 0 2 37957.468750 1 5 35596.599610 2 10 36065.546875 3 20 36739.576172
Now, it's time to tune "colsample_bytree"
. You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier
or RandomForestRegressor
, where it just was called max_features
. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree
must be specified as a float between 0 and 1.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter dictionary
params={"objective":"reg:squarederror", "max_depth":3}
# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []
# Systematically vary the hyperparameter value
for curr_val in colsample_bytree_vals:
params['colsample_bytree'] = curr_val
# Perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
num_boost_round=10, early_stopping_rounds=5,
metrics="rmse", as_pandas=True, seed=123)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)),
columns=["colsample_bytree","best_rmse"]))
colsample_bytree best_rmse 0 0.1 48193.453125 1 0.5 36013.544922 2 0.8 35932.962891 3 1.0 35836.042968
There are several other individual parameters that you can tune, such as "subsample"
, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!
Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's GridSearch
and RandomizedSearch
capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with GridSearchCV
!
from sklearn.model_selection import GridSearchCV
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'colsample_bytree': [0.3, 0.7],
'n_estimators': [50],
'max_depth': [2, 5]
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()
# Perform grid search: grid_mse
grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm,
scoring='neg_mean_squared_error', cv=4, verbose=1)
# Fit grid_mse to the data
grid_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Fitting 4 folds for each of 4 candidates, totalling 16 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found: {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50} Lowest RMSE found: 30744.105707685176
[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished
Often, GridSearchCV
can be really time consuming, so in practice, you may want to use RandomizedSearchCV
instead, as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV
code to do RandomizedSearchCV
. The key difference is you have to specify a param_distributions
parameter instead of a param_grid
parameter.
from sklearn.model_selection import RandomizedSearchCV
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'n_estimators': [25],
'max_depth': range(2, 12)
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)
# Perform random search: randomized_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm,
scoring='neg_mean_squared_error', n_iter=5, cv=4,
verbose=1)
# Fit randomized_mse to the data
randomized_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found: {'n_estimators': 25, 'max_depth': 4} Lowest RMSE found: 29998.4522530019
[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.5s finished