Abstract. This notebook is an example on how to use hyperopt
to train XGBoost
and CatBoost
classifiers. A cross-validation object is also defined.
For this notebook, we wil use the breast cancer dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
X = data['data']
y = data['target']
To prepare for training and validation, we will split this dataset into training and test
from sklearn.model_selection import StratifiedShuffleSplit
tran_size = 0.75
n_splits = 1
sss = StratifiedShuffleSplit(n_splits=n_splits, train_size=tran_size)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
XGBClassifier
(X), and CatBoostClassifier
(C)¶Some basic parameters:
learning_rate
[X/C]: learning rate (alias: eta
)max_depth
[X/C]: maximum depth of treesn_estimators
[X/C]: no. of boosting iterationsmin_child_weight
[X]: minimum sum of instance (hessian) weight needed in a childmin_child_samples
[C]: minimum no. of data in one leafsubsample
[X/C]: subsample ratio of the training instances (note that for CatBoost this parameter can be used only if either Poisson or Bernoulli bootstrap_type is * selected
)colsample_bytree
[X]: subsample ratio of columns in tree buildingcolsample_bylevel
[X/C]: subsample ratio of columns for each level in tree buildingcolsample_bynode
[X]: subsample ratio of columns for each nodetree_method
[X]: tree construction methodboosting_type
[C]: Ordered for ordered boosting or Plain for classicearly_stopping_rounds
[X/C]: parameter for fit() — stop the training if one metric of a validation data does not improve in last early_stopping_rounds roundseval_metric
[X/C]: evaluation metrics for validation datahyperopt
¶There two options for the search algortihm:
hyperopt.tpe.suggest
hyperopt.rand.suggest
The search space is defined using uniform, loguniform, normal distribution along with its quantized variations. For instance, uniform('x', -1, 1)
defines a search space with label x
that will be sampled uniformly between -1 and 1. The expressions currently recognized by hyperopt’s optimization algorithms are:
hp.choice(label, options)
: index of an optionhp.randint(label, upper)
: random integer within $[0, \text{upper})$hp.uniform(label, low, high)
: uniform value between low/highhp.quniform(label, low, high, q)
: round(uniform(.)/q)*q
(note that the value is giving a float instead of integer)hp.loguniform(label, low, high)
: exp(uniform(low, high)/q)*q
hp.qloguniform(label, low, high, q)
: round(loguniform())
hp.normal(label, mu, sigma)
: sampling from normal distributionhp.qnormal(label, mu, sigma, q)
: round(normal(nu, sigma)/q)*q
hp.lognormal(label, mu, sigma)
: exp(normal(mu, sigma)
hp.qlognormal(label, mu, sigma, q)
: round(exp(normal(.))/q)*q
XGBClassifier
using hyperopt
¶The search space for the hyperparameters is defined by the following distributions
from hyperopt import hp
space = {
'max_depth' : hp.randint('max_depth', 10),
'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),
'n_estimators' : hp.randint('n_estimators', 250),
'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),
'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),
'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01)
}
A first cost function is defined for the XGBoostClassifier
and it is compose of three steps
import warnings
import numpy as np
from hyperopt import STATUS_FAIL, STATUS_OK, Trials, fmin, tpe
from sklearn.metrics import f1_score, log_loss, roc_auc_score
from xgboost import XGBClassifier
def xgb_cost(space, X_train, y_train):
warnings.filterwarnings(action='ignore')
# 1. Instantiate estimator with selected hyperparamters and fit
classifier = XGBClassifier(n_estimators = space['n_estimators']
,max_depth = int(space['max_depth'])
,learning_rate = space['learning_rate']
,gamma = space['gamma']
,min_child_weight = space['min_child_weight']
,subsample = space['subsample']
,colsample_bytree = space['colsample_bytree']
,use_label_encoder=False
,eval_metric="logloss"
)
classifier.fit(X_train, y_train)
# 2. Evaluate the estimator according to chose performance metrics
y_proba_train = classifier.predict_proba(X_train)[:, 1]
y_class_train = classifier.predict(X_train)
rocauc_train = roc_auc_score(y_train, y_proba_train)
f1_train = f1_score(y_train, y_class_train)
logloss_train = log_loss(y_train, y_class_train)
# 3. Calculate the loss
loss = ((1.0 - rocauc_train)**2 + (1.0 - f1_train)**2 + logloss_train**2)
return {'loss': np.sqrt(loss), 'status': STATUS_OK }
In the next step, the objective function is defined from the cost function and the minization problem is solved. The output are the optimal hyperparameters found in the search space.
objective = lambda x: xgb_cost(x, X_train=X_train, y_train=y_train)
trials = Trials()
best = fmin(fn=objective,
space=space,
algo=tpe.suggest,
max_evals=100,
trials=trials)
print("Optimal hyperparameters: ", best)
100%|██████████| 100/100 [00:08<00:00, 11.28trial/s, best loss: 9.992007221626413e-16] Best: {'colsample_bytree': 0.68, 'gamma': 0.24, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1.0, 'n_estimators': 38, 'subsample': 0.9400000000000001}
hyperopt
for Cross Validation¶Here, a cross validation class is defined
import sys
import traceback
from collections.abc import Iterable
from typing import Union
from hyperopt import STATUS_FAIL, STATUS_OK, Trials, fmin, hp, tpe
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import get_scorer
class BayesSearchCV:
def __init__(self, estimator, param_distributions: dict, scoring: dict
,n_iter: int=10, weights_matrix: np.ndarray=None, cv: Union[int, Iterable]=5, random_state: int=None
,algo=tpe.suggest, trials: Trials=Trials()) -> None:
"""Use Bayesian optimisation to search for hyperpameters and selects best estimator based on validation sets
Args:
estimator: Estimator object
param_distributions (dict): Search space containing hyperparameters
scoring (dict): {metric: opt_value} Dict of performance metrics to measure the estimator performance and their corresponding optimal values.
Select one from sklearn.metrics.SCORERS.keys()
n_iter (int, optional): Max number of iterations. Defaults to 10.
weights_matrix (np.ndarray, optional): Symmetric positive definite matrix used to calculate the quadratic loss function
cv (int or Iterable, optional): int, cross-validation generator or an iterable. Defaults to 5.
random_state (int, optional): Pseudo random number generator state used for random uniform sampling. Defaults to None.
algo (optional): Algorithm to for distribution search. Defaults to tpe.suggest.
trials (Trials, optional): [description]. Defaults to Trials().
"""
self.estimator = estimator
self.param_distributions = param_distributions
self.n_iter = n_iter
self.random_state = random_state
self.weights_matrix = weights_matrix or np.identity(len(scoring))
self.cv = cv
self.algo = algo
self.trials = trials
self.scoring = scoring
def fit(self, X: pd.DataFrame, y=None) -> None:
"""Find optimal hyperparameters and fit estimator
Args:
X (pd.DataFrame): Predictors
y (pd.DataFrame): Target
"""
self.cv_results_ = pd.DataFrame()
self.min_loss = np.inf
if not self._check_spd(self.weights_matrix):
msg = f"Expected weights matrix to be symmetric positive definite."
raise ValueError(msg)
for iteration, (train_index, val_index) in enumerate(self._get_splits(X, y)):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
objective = lambda space: self._cost(X=X_train, y=y_train, hyperparameters=space)
try:
hyperparameters = fmin(fn=objective, space=self.param_distributions
,algo=self.algo, max_evals=self.n_iter
,trials=self.trials)
except KeyError:
exc_info = sys.exc_info()
traceback.print_exception(*exc_info)
return {'status': STATUS_FAIL,
'exception': str(sys.exc_info())}
estimator = self._instantiate_estimator(X_train, y_train, hyperarameters=hyperparameters)
loss_df, current_loss = self._cost(X_val, y_val, hyperparameters, estimator=estimator, return_loss_df=True)
loss_df["cv_iteration"] = iteration
if current_loss < self.min_loss:
self.min_loss = current_loss
self.best_estimator_ = estimator
self.best_hyperparameters_ = hyperparameters
self.cv_results_ = pd.concat([self.cv_results_ , loss_df.copy()])
self.cv_results_.rename(columns={col: f"{col}_loss" for col in self.cv_results_.columns if col != "loss"}, inplace=True)
self.cv_results_.sort_values(by="loss", inplace=True)
self.cv_results_.reset_index(inplace=True, drop=True)
def _get_splits(self, X: pd.DataFrame, y=None):
"""Instantiate and/or get training and validation datasets
Args:
X (pd.DataFrame): Predictor
y (pd.DataFrame): Target
Yields:
[type]: Train and test indices
Raises:
NotImplementedError: Only KFold and StratifiedKFold are implemented
"""
if isinstance(self.cv, int):
self.cv = KFold(n_splits=self.cv, random_state=self.random_state)
elif isinstance(self.cv, StratifiedKFold):
pass
else:
msg = f"Cross validation not yet implemented for type {type(self.cv)}"
NotImplementedError(msg)
for train_index, test_index in self.cv.split(X, y):
yield train_index, test_index
def _cost(self, X: pd.DataFrame, y: pd.DataFrame, hyperparameters: dict
,estimator=None, return_loss_df: bool=False) -> dict:
"""Evaluates the cost function for the trained estimator using a quadratic loss function
Args:
X (pd.DataFrame): Predictor
y (pd.DataFrame): Target
hyperarameters (dict): Estimator hyperparameters
return_loss_df (bool): Returns fit loss data frame
Returns:
(dict)
"""
loss_dict = {metric_name: [] for metric_name in self.scoring}
if not estimator:
estimator = self._instantiate_estimator(X, y, hyperparameters)
for p_metric, opt_value in self.scoring.items():
scorer = get_scorer(p_metric)
loss_dict[p_metric].append((opt_value - scorer(estimator, X, y))**2)
loss_df = pd.DataFrame.from_dict(loss_dict)
loss = loss_df.values.dot(self.weights_matrix.dot(loss_df.T.values))
loss_df["loss"] = loss
if return_loss_df:
return loss_df, loss
return {'loss': np.sqrt(loss), 'status': STATUS_OK}
def _instantiate_estimator(self, X: pd.DataFrame, y: pd.DataFrame
,hyperarameters: dict):
"""Instantiate estimator with selected hyperparameters
Args:
X (pd.DataFrame): Predictors
y (pd.DataFrame): Target
hyperarameters (dict): Estimator hyperparameters
Returns:
[type]: Estimator
"""
estimator_cls = self.estimator.__class__
estimator = estimator_cls(**hyperarameters)
estimator.fit(X, y)
return estimator
@staticmethod
def _check_spd(m: np.ndarray, rtol: float=1e-6, atol: float=1e-9) -> bool:
"""Checks if a matrix is symmetric positive definite
Args:
m (np.ndarray): Matrix
rtol (float): Relative tolerance to verify if m is symmetric
atol (float): Absolute tolerance to verify if m is symmetric
Returns:
(bool): True if matrix is symmetric positive definite
"""
try:
# Check if matrix is positive definite
np.linalg.cholesky(m)
except np.linalg.linalg.LinAlgError as err:
if 'Matrix is not positive definite' in err.message:
return False
else:
raise
else:
# Now that m is positive definite, check if it is symmetric
return np.allclose(m, m.T, rtol=rtol, atol=atol)
XGBClassifier
¶space = {
'max_depth' : hp.randint('max_depth', 10),
'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),
'n_estimators' : hp.randint('n_estimators', 250),
'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),
'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),
'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01),
'eval_metric': "logloss"
}
cv = BayesSearchCV(XGBClassifier(learning_rate=0.1), param_distributions=space, scoring={"roc_auc": 1, "f1": 1, "neg_log_loss": 0}, cv=StratifiedKFold(), n_iter=100)
cv.fit(X_train, y_train)
print("Loss value according to each performance metric:\n", cv.cv_results_)
print("Optimal hyperparameters: ", cv.best_hyperparameters_)
100%|██████████| 100/100 [00:09<00:00, 10.47trial/s, best loss: 5.735027908132748e-05] [15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. 100%|██████████| 100/100 [00:00<?, ?trial/s, best loss=?] [15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. 100%|██████████| 100/100 [00:00<?, ?trial/s, best loss=?] [15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. 100%|██████████| 100/100 [00:00<?, ?trial/s, best loss=?] [15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. 100%|██████████| 100/100 [00:00<?, ?trial/s, best loss=?] [15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Loss value according to each performance metric: roc_auc_loss f1_loss neg_log_loss_loss loss cv_iteration_loss 0 0.000005 0.000343 0.003406 0.003754 0 1 0.000023 0.000084 0.004783 0.004890 1 2 0.000013 0.000786 0.006792 0.007591 3 3 0.000126 0.003328 0.020047 0.023501 4 4 0.001561 0.001424 0.060831 0.063816 2 Optimal hyperparameters: {'colsample_bytree': 0.74, 'gamma': 0.04, 'learning_rate': 0.2, 'max_depth': 7, 'min_child_weight': 1.0, 'n_estimators': 245, 'subsample': 0.8300000000000001}
CatBoostClassifier
¶from catboost import CatBoostClassifier
space = {
'learning_rate': hp.choice('learning_rate', np.arange(0.05, 0.31, 0.05)),
'max_depth': hp.choice('max_depth', np.arange(5, 16, 1, dtype=int)),
'n_estimators': 100,
'verbose': False
}
cv = BayesSearchCV(CatBoostClassifier(), param_distributions=space, scoring={"roc_auc": 1, "f1": 1}, cv=StratifiedKFold())
cv.fit(X_train, y_train)
print("Loss value according to each performance metric:\n", cv.cv_results_)
print("Optimal hyperparameters: ", cv.best_hyperparameters_)
Loss value according to each performance metric: roc_auc_loss f1_loss loss cv_iteration_loss 0 0.000301 0.001322 0.001624 0 1 0.000184 0.004619 0.004803 4 2 0.000660 0.005917 0.006577 1 3 0.001252 0.006818 0.008069 3 4 0.001956 0.009612 0.011567 2 Optimal hyperparameters: {'learning_rate': 3, 'max_depth': 4}