Impact of the dependency between samples on cross-validation test score estimates¶

In [1]:

import numpy as np
from sklearn.datasets import load_digits

In [2]:

digits = load_digits()
X, y = digits.data, digits.target

The digits dataset of scikit-learn is the test set of the UCI optdigits dataset. Apparently consecutive samples are more likely to stem from the same writer on this dataset. Hence the samples are not independent and identically distributed (iid) as different writing styles grouped togethers effectively introduce a dependency. Unfortunately the exact per-sample authorship metadata has not be kept in the optdigits dataset.

This is highlighted by the fact that shuffling the data significantly affects the test score estimated by K-Fold cross-validation. Let us build a model with non-optimal parameters to highlight the impact of dependent samples:

In [3]:

from sklearn.svm  import SVC

model = SVC(C=10, gamma=0.005)

In [4]:

from sklearn.cross_validation import cross_val_score

def print_cv_score_summary(model, X, y, cv):
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    print("mean: {:3f}, stdev: {:3f}".format(
        np.mean(scores), np.std(scores)))

KFold does not shuffle the data by default hence takes the dependency structure of the dataset into account for small number of folds such as k=5:

In [5]:

from sklearn.cross_validation import KFold

cv = KFold(len(y), 5)
print_cv_score_summary(model, X, y, cv)

mean: 0.901543, stdev: 0.037016

If we shuffle the data, the estimated test score is much higher as we hide the dependency structure to the model hence we cannot detect the overfitting caused by the author writing styles:

In [6]:

cv = KFold(len(y), 5, shuffle=True, random_state=0)
print_cv_score_summary(model, X, y, cv)

mean: 0.968836, stdev: 0.007350

In [7]:

cv = KFold(len(y), 5, shuffle=True, random_state=1)
print_cv_score_summary(model, X, y, cv)

mean: 0.967725, stdev: 0.004847

In [8]:

cv = KFold(len(y), 5, shuffle=True, random_state=2)
print_cv_score_summary(model, X, y, cv)

mean: 0.966622, stdev: 0.010217

There is almost 7% discrepancy between the estimated score probably caused by the dependency between samples.

Those shuffled KFold cv scores are in-line with equivalent ShuffleSplit:

In [9]:

from sklearn.cross_validation import ShuffleSplit

cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=0)
print_cv_score_summary(model, X, y, cv)

mean: 0.971667, stdev: 0.007115

In [10]:

cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=1)
print_cv_score_summary(model, X, y, cv)

mean: 0.973333, stdev: 0.003333

In [11]:

cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=2)
print_cv_score_summary(model, X, y, cv)

mean: 0.958333, stdev: 0.008784

Note that StratifiedKFold sorts the samples by classes prior to computing the folds hence breaks the dependency too (at least in scikit-learn 0.14):

In [12]:

from sklearn.cross_validation import StratifiedKFold

cv = StratifiedKFold(y, 5)
print_cv_score_summary(model, X, y, cv)

mean: 0.969404, stdev: 0.010674