We want to develop a model that performs well in the real world. To do that,
So we need a way to estimate the test error. The way we do that is by creating a holdout sample (step #2 in the machine learning workflow) to test the model at step #6 after our model development process is completed.
{admonition}
:class: warning
**If you use the holdout during the iterative training/evaluation process (steps 3-5), it stops being a holdout sample and effectively becomes part of the training set.**
After we create the holdout sample in step #2, we enter the grey box where all the ML magic happens, and pre-process the data (step #3).
So, we are at step #3. We have the holdout sample isolated and will develop our model on the training sample.
If we fit the model on the training sample and then examine its performance against the same sample, the error will be misleadingly low. (Duh! We fit the model on it!)
{admonition}
It would be nice to have an extra test set, wouldn't it?!
Great idea: So, at step \#3, we split our training sample again, into a _**smaller training sample**_ and a _**validation sample**_.
- We estimate and fit our model on the training sample
- We make predictions in the validation sample using our fitted model and measure the "accuracy" of the prediction.
Depends. Commonly, it's 70%-15%-15%.
Problem: 70% and 15% might not be enough to train and evaluate your candidate models. What if your 15% validation sample just randomly happens to work very well with your model in a way that doesn't replicate beyond it?
{admonition}
It would be nice to have more validation samples, wouldn't it?!
Great idea: **K-Fold Cross-Validation.**
K-Fold Cross-Validation: Take the training data (85% of the original sample) and split it up into a training and validation sample, fit the model on the smaller training sample, and test it on the validation sample. Then repeat this by dividing the data into a different training/validation split. And repeat this $k$ times to get $k$ validation samples.
Here, it looks like this:
{tip}
When you've estimated your model on each of the 5 folds, and applied your predictions to the corresponding validation sample, you will get some "accuracy score" ([which might be R2 or something else listed here](03d_whatToMax)) for that model for each of the 5 folds.
So if you run $N$ models, you'll end up with a dataset like this:
| Model | Fold 1 score | Fold 2 score | ... | Fold $k$ score |
| --- | --- | --- | --- | --- |
1 | .6 | .63 | ... | .57
2 | .8 | .4 | ... | ..6
... | | | |
N | .5 | .1 | ... | .2
Store that info, as it will help you decide which model to pick!
{warning}
The exact method of splitting the sample above (split the data up into 5 random chunks) doesn't work well in many finance settings. The next page will explain!