import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
The digits dataset of scikit-learn is the test set of the UCI optdigits dataset. Apparently consecutive samples are more likely to stem from the same writer on this dataset. Hence the samples are not independent and identically distributed (iid) as different writing styles grouped togethers effectively introduce a dependency. Unfortunately the exact per-sample authorship metadata has not be kept in the optdigits dataset.
This is highlighted by the fact that shuffling the data significantly affects the test score estimated by K-Fold cross-validation. Let us build a model with non-optimal parameters to highlight the impact of dependent samples:
from sklearn.svm import SVC
model = SVC(C=10, gamma=0.005)
from sklearn.cross_validation import cross_val_score
def print_cv_score_summary(model, X, y, cv):
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
print("mean: {:3f}, stdev: {:3f}".format(
np.mean(scores), np.std(scores)))
KFold does not shuffle the data by default hence takes the dependency structure of the dataset into account for small number of folds such as k=5:
from sklearn.cross_validation import KFold
cv = KFold(len(y), 5)
print_cv_score_summary(model, X, y, cv)
mean: 0.901543, stdev: 0.037016
If we shuffle the data, the estimated test score is much higher as we hide the dependency structure to the model hence we cannot detect the overfitting caused by the author writing styles:
cv = KFold(len(y), 5, shuffle=True, random_state=0)
print_cv_score_summary(model, X, y, cv)
mean: 0.968836, stdev: 0.007350
cv = KFold(len(y), 5, shuffle=True, random_state=1)
print_cv_score_summary(model, X, y, cv)
mean: 0.967725, stdev: 0.004847
cv = KFold(len(y), 5, shuffle=True, random_state=2)
print_cv_score_summary(model, X, y, cv)
mean: 0.966622, stdev: 0.010217
There is almost 7% discrepancy between the estimated score probably caused by the dependency between samples.
Those shuffled KFold cv scores are in-line with equivalent ShuffleSplit
:
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=0)
print_cv_score_summary(model, X, y, cv)
mean: 0.971667, stdev: 0.007115
cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=1)
print_cv_score_summary(model, X, y, cv)
mean: 0.973333, stdev: 0.003333
cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=2)
print_cv_score_summary(model, X, y, cv)
mean: 0.958333, stdev: 0.008784
Note that StratifiedKFold
sorts the samples by classes prior to computing the folds hence breaks the dependency too (at least in scikit-learn 0.14):
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(y, 5)
print_cv_score_summary(model, X, y, cv)
mean: 0.969404, stdev: 0.010674