This notebook demonstrates stacking machine learning algorithm - folding, which physics use in their analysis.
%pylab inline
Populating the interactive namespace from numpy and matplotlib
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc MiniBooNE_PID.txt https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
File `MiniBooNE_PID.txt' already there; not retrieving.
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)
variables = list(data.columns)
It implements the same interface as all classifiers, but with some difference:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
from rep.metaml import FoldingClassifier
n_folds = 4
folder = FoldingClassifier(GradientBoostingClassifier(), n_folds=n_folds, features=variables)
folder.fit(train_data, train_labels)
FoldingClassifier(base_estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, random_state=None, subsample=1.0, verbose=0, warm_start=False), features=['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'featu...', 'feature_43', 'feature_44', 'feature_45', 'feature_46', 'feature_47', 'feature_48', 'feature_49'], ipc_profile=None, n_folds=4, random_state=None)
folder.predict_proba(train_data)
KFold prediction using folds column
array([[ 0.96744155, 0.03255845], [ 0.9254746 , 0.0745254 ], [ 0.98048878, 0.01951122], ..., [ 0.98744345, 0.01255655], [ 0.99227938, 0.00772062], [ 0.02780023, 0.97219977]])
vote_function
)¶# definition of mean function, which combines all predictions
def mean_vote(x):
return numpy.mean(x, axis=0)
folder.predict_proba(test_data, vote_function=mean_vote)
Using voting KFold prediction
array([[ 0.14722266, 0.85277734], [ 0.95026913, 0.04973087], [ 0.99504987, 0.00495013], ..., [ 0.17173721, 0.82826279], [ 0.94682241, 0.05317759], [ 0.07535153, 0.92464847]])
Again use ClassificationReport
class to compare different results. For folding classifier this report uses only default prediction.
from rep.data.storage import LabeledDataStorage
from rep.report import ClassificationReport
# add folds_column to dataset to use mask
train_data["FOLDS"] = folder._get_folds_column(len(train_data))
lds = LabeledDataStorage(train_data, train_labels)
report = ClassificationReport({'folding': folder}, lds)
KFold prediction using folds column
Use mask
parameter to plot distribution for the specific fold
for fold_num in range(n_folds):
report.prediction_pdf(mask="FOLDS == %d" % fold_num, labels_dict={1: 'sig fold %d' % fold_num}).plot()
for fold_num in range(n_folds):
report.prediction_pdf(mask="FOLDS == %d" % fold_num, labels_dict={0: 'bck fold %d' % fold_num}).plot()
for fold_num in range(n_folds):
report.roc(mask="FOLDS == %d" % fold_num).plot()
NOTE: Here vote function is None, so default prediction is used
lds = LabeledDataStorage(test_data, test_labels)
report = ClassificationReport({'folding': folder}, lds)
KFold prediction using folds column
report.prediction_pdf().plot(new_plot=True, figsize = (9, 4))
report.roc().plot(xlim=(0.5, 1))