This notebook demonstrates several additional tools to optimize classification model provided by Reproducible experiment platform (REP) package:
grid search for the best classifier hyperparameters
different optimization algorithms
different scoring models (optimization of arbirtary figure of merit)
%pylab inline
Populating the interactive namespace from numpy and matplotlib
!cd toy_datasets; wget -O magic04.data -nc https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data
File `magic04.data' already there; not retrieving.
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
columns = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'g']
data = pandas.read_csv('toy_datasets/magic04.data', names=columns)
labels = numpy.array(data['g'] == 'g', dtype=int)
data = data.drop('g', axis=1)
train_data, test_data, train_labels, test_labels = train_test_split(data, labels)
list(data.columns)
['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist']
features = list(set(columns) - {'g'})
In Higgs challenge the aim is to maximize AMS
metrics.
To measure the quality one should choose not only classifier, but also an optimal threshold,
where the maximal value of AMS
is achieved.
Such metrics (which require a threshold) are called threshold-based.
rep.utils contain class OptimalMetric, which computes the maximal value for threshold-based metric (and may be used as metric).
Use this class to generate metric and use it in grid search.
first we define AMS metric, and utils.OptimalMetric
generates
from rep.report import metrics
def AMS(s, b, s_norm=sum(test_labels == 1), b_norm=6*sum(test_labels == 0)):
return s * s_norm / numpy.sqrt(b * b_norm + 10.)
optimal_AMS = metrics.OptimalMetric(AMS)
sum(test_labels == 1),sum(test_labels == 0)
(3071, 1684)
random predictions for signal and background were used here
probs_rand = numpy.ndarray((1000, 2))
probs_rand[:, 1] = numpy.random.random(1000)
probs_rand[:, 0] = 1 - probs_rand[:, 1]
labels_rand = numpy.random.randint(0, high=2, size=1000)
optimal_AMS.plot_vs_cut(labels_rand, probs_rand)
Optimal cut=0.0096, quality=30.5758
optimal_AMS(labels_rand, probs_rand)
30.575776003303893
AbstractParameterGenerator is an abstract class to generate new points, where the scorer function will be computed. It is used in grid search to get new set of parameters to train classifier.
Properties:
best_params_
- return the best grid point
best_score_
- return the best quality
print_results(self, reorder=True)
- print all points with corresponding quality
The following algorithms inherit from AbstractParameterGenerator:
RandomParameterOptimizer
- generates random point in parameters space
RegressionParameterOptimizer
- generate next point using regression algorithm, which was trained on previous results
SubgridParameterOptimizer
- uses subgrids if grid is huge + annealing-like technique (details see in REP)
GridOptimalSearchCV implemets optimal search over specified parameter values for an estimator. Parameters to use it are:
estimator - object of type that implements the "fit" and "predict" methods
params_generator - generator of grid search algorithm (AbstractParameterGenerator)
scorer - which implement method call with kwargs: "base_estimator", "params", "X", "y", "sample_weight"
Important members are "fit", "fit_best_estimator"
from rep.metaml import GridOptimalSearchCV
from rep.metaml.gridsearch import RandomParameterOptimizer, FoldingScorer
from rep.estimators import SklearnClassifier
from sklearn.ensemble import AdaBoostClassifier
from collections import OrderedDict
FoldingScorer provides folding cross-validation for train dataset:
k
, number of folds (train on k-1
fold, test on 1
fold)NOTE: if fold_checks > 1, the quality is averaged over tests.
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]
# use random hyperparameter optimization algorithm
generator = RandomParameterOptimizer(grid_param)
# define folding scorer
scorer = FoldingScorer(optimal_AMS, folds=6, fold_checks=4)
grid_sk = GridOptimalSearchCV(SklearnClassifier(AdaBoostClassifier(), features=features), generator, scorer)
grid_sk.fit(data, labels)
<rep.metaml.gridsearch.GridOptimalSearchCV at 0x10c4e7f50>
grid_sk.generator.best_params_
OrderedDict([('n_estimators', 50), ('learning_rate', 0.2)])
grid_sk.generator.print_results()
58.674: n_estimators=50, learning_rate=0.2 56.484: n_estimators=30, learning_rate=0.2 56.100: n_estimators=50, learning_rate=0.1 53.704: n_estimators=30, learning_rate=0.1 53.278: n_estimators=50, learning_rate=0.05 53.091: n_estimators=30, learning_rate=0.05
You can define your own scorer with specific logic by simple way. Scorer must have just the following:
from sklearn import clone
def generate_scorer(test, test_labels, test_weight=None):
""" Generate scorer which calculate metric on fixed test dataset """
def custom(base_estimator, params, X, y, sample_weight=None):
cl = clone(base_estimator)
cl.set_params(**params)
cl.fit(X, y)
res = optimal_AMS(test_labels, cl.predict_proba(test), sample_weight)
return res
return custom
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]
grid_param['features'] = [features[:5], features[:8]]
# define random hyperparameter optimization algorithm
generator = RandomParameterOptimizer(grid_param)
# define specific scorer
scorer = generate_scorer(test_data, test_labels)
grid = GridOptimalSearchCV(SklearnClassifier(clf=AdaBoostClassifier(), features=features), generator, scorer)
grid.fit(train_data, train_labels)
<rep.metaml.gridsearch.GridOptimalSearchCV at 0x111badf90>
len(train_data), len(test_data)
(14265, 4755)
grid.generator.print_results()
60.269: n_estimators=50, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 57.120: n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 56.238: n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 55.923: n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 55.677: n_estimators=30, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 55.221: n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha'] 37.438: n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long'] 37.281: n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long'] 37.218: n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long'] 37.063: n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
from rep.report import ClassificationReport
from rep.data.storage import LabeledDataStorage
lds = LabeledDataStorage(test_data, test_labels)
classifiers = {'grid_fold': grid_sk.fit_best_estimator(train_data[features], train_labels),
'grid_test_dataset': grid.fit_best_estimator(train_data[features], train_labels) }
report = ClassificationReport(classifiers, lds)
report.roc().plot()
report.metrics_vs_cut(AMS, metric_label='AMS').plot()