Model evaluation via cross validation (CV)

We want to develop a model that performs well in the real world. To do that,

  1. Create a holdout sample.
  2. "Train" many models on a portion of the training sample, and use each to predict $y$ on the remaining/unused part of the training sample.
  3. Repeat step 2 $k$ times, to see how consistently accurate each model is.
  4. Pick your preferred model.
  5. One time only: Apply your preferred model to the holdout sample to see if it performs as well.

Holdout sample

So we need a way to estimate the test error. The way we do that is by creating a holdout sample (step #2 in the machine learning workflow) to test the model at step #6, after our model development process is completed.

```{admonition} YOU CAN ONLY USE THE HOLDOUT SAMPLE ONCE! :class: warning

If you use the holdout during the iterative training/evaluation process (steps 3-5), it stops being a a holdout sample and effectively becomes part of the training set.


After we create the holdout sample in step \#2, we enter the grey box where all the ML magic happens, and pre-process the data (step \#3).

### Training sample

So, we are at step \#3. We have the holdout sample isolated on the side, and will develop our model on the training sample. 

If we fit the model on the training sample, and then examine its performance against the same sample, the error will be misleadingly low. (Duh! We fit the model on it!) 

```{admonition} A ha!

It would be nice to have an extra test set, wouldn't it?! 

Great idea: So, at step \#3, we split our training sample again, into a _**smaller training sample**_ and a _**validation sample**_. 
- We estimate and fit our model on the training sample
- We estimate make predictions in the validation sample using our fit model, and measure the "accuracy" of the prediction.

Train-validate-holdout: How big should they be?

Depends. Commonly, it's 70%-15%-15%.

K-Fold Cross-Validation

Problem: 70% and 15% might not be enough to train and evaluate your candidate models. What if your 15% validation sample just randomly happens to work very well with your model in a way that doesn't replicate beyond it?

```{admonition} A ha!

It would be nice to have more validation samples, wouldn't it?!

Great idea: K-Fold Cross-Validation.

```

K-Fold Cross-Validation: Take the training data (85% of the original sample) and split it up into a training and validation sample, fit the model on the smaller training sample, and test it on the validation sample. Then repeat this by dividing the data into a different training/validation split. And repeat this $k$ times to get $k$ validation samples.

Here, it looks like this:

From DS100

Summary of CV

  1. Separate out part of the dataset as a "holdout" or "test" set. The rest of the data is the "training set".
  2. Start steps #3-#5 of the flowchart.
    1. Split the "training set" up into a smaller "training" set and a "validation set".
    2. Estimate the model on the smaller training set. The error of the model on this is the "training error".
    3. Apply the model to the validation set to calculate the "validation error".
    4. Repeat these "training-validation" steps $k$ times to get the average validation error.
    5. Repeat these steps on each candidate model you are considering.
    6. The model that has the lowest average validation error is your model.
  3. Test the model against the holdout test sample and compute your final "test error". You are done, and forbidden from tweaking the model now.