This notebook will show

- the process of dividing train/test/validation sets
- an example of overfitting
- an example of estimator complexity

In [1]:

```
import pandas as pd
```

In [2]:

```
import sklearn.datasets as datasets
X, y = datasets.make_circles(n_samples=2000, factor=0.2, noise=0.24, random_state=42)
# X, y = datasets.make_blobs(n_samples=2000, cluster_std=1.0, random_state=0) # another dataset to try
# X, y = datasets.make_moons(n_samples=2000, noise=0.3, random_state=0) # another dataset to try
df_train = pd.DataFrame({"x0": X[:, 0], "x1": X[:, 1], "y": y})
df_train.plot.scatter(
x="x0", y="x1", c="y",
cmap="tab10", alpha=1.0,
vmax=10, colorbar=False, figsize=(4, 4),
)
```

Out[2]:

<AxesSubplot:xlabel='x0', ylabel='x1'>

In [3]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
```

In [4]:

```
from sklearn.tree import DecisionTreeClassifier
max_depth = 8
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
clf.fit(X_train, y_train)
```

Out[4]:

DecisionTreeClassifier(max_depth=8, random_state=42)

Let's visualize the decision region:

In [5]:

```
# This cell produces a grid of points, that covers the full dataset
# Don't worry about the syntax for now!
import numpy as np
x0 = np.linspace(df_train["x0"].min(), df_train["x0"].max())
x1 = np.linspace(df_train["x1"].min(), df_train["x1"].max())
x0, x1 = np.meshgrid(x0, x1)
df_test = pd.DataFrame({"x0": x0.flat[:], "x1": x1.flat[:]})
```

In [6]:

```
y_pred = clf.predict(df_test[["x0", "x1"]])
df_test["prediction"] = y_pred
```

In [7]:

```
df_test.plot.scatter(
x="x0", y="x1", c="prediction", cmap="tab10", vmax=10, colorbar=False, figsize=(4, 4)
)
```

Out[7]:

<AxesSubplot:xlabel='x0', ylabel='x1'>

Evaluate the accuracy of the classifier (predict on the test set)

In [8]:

```
y_pred = clf.predict(X_test)
```

In [9]:

```
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
```

Out[9]:

0.945

It looks like the choice of `max_depth`

could be a little better; the model looks to be a little rough around the edges.

`max_depth`

¶Let's choose the best `max_depth`

. One rule to keep in mind:

**Do not test the model on any data used to train.**

Or, said a different way,

**Only use the test data once at the very end**

That means that all hyperparameter selection should be performed with the train dataset.

Let's wrap the above code used to get intuition on the classifier into a function. This will be done two different ways:

- Using a manual train/test split.
- Using a fancier method that provides some niceties

The nicer function will make it less clear what's happening, but offers a more robust scoring process.

Let's understand the process. What data is used for the model selection process?

In [10]:

```
def max_depth_info(X_train, y_train, max_depth=4):
"""
Find the best depth for DecisionTreeClassifier.
This function should take the train data
(which is why "_train" appended to input names).
Because training data should never be used for testing,
it splits the input data one more time into dataset for
training and validation.
Returns
-------
dictionary with keys "max_depth", "train_accuracy", "val_accuracy".
"""
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
clf.fit(X_train, y_train)
val_score = clf.score(X_val, y_val)
# This is only to compare with val_score
train_score = clf.score(X_train, y_train)
return {"max_depth": max_depth,
"train_accuracy": train_score,
"val_accuracy": val_score}
```

`train_test_split`

splits a dataset into two parts. See the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

`cross_val_score`

¶k-fold cross validation is defined as a process that...

- splits the data into $k$ chunks
- for $i = 1, ..., k$,
- $k - 1$ chunks are used for training
- $1$ chunk is used for testing

Then, a list of scores of length $k$ is returned.

In [11]:

```
from sklearn.model_selection import cross_val_score
def max_depth_info(X_train, y_train, max_depth=4):
"""
This is function is very slightly different than the previous definition:
It has the same inputs and outputs.
However, it uses Scikit-Learn's `cross_val_score` instead of
the manual train_test_split. This means that this function trains
and scores 5 different models. The "validation score" is defined to
be the mean of these 5 different scores.
"""
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
clf.fit(X_train, y_train)
scores = cross_val_score(clf, X, y, cv=5)
return {"max_depth": max_depth,
"train_accuracy": clf.score(X_train, y_train),
"val_accuracy": scores.mean()}
```

**do not test the model on any data used to train.** What if the model just memorizes the training points?

Now, let's call that function repeatedly to see

In [12]:

```
data = [max_depth_info(X_train, y_train, max_depth=k)
for k in range(1,15)]
```

In [13]:

```
df = pd.DataFrame(data)
ax = df.plot(x="max_depth", y=["train_accuracy", "val_accuracy"], style="o-", grid=True)
```

In [14]:

```
df.head(8)
```

Out[14]:

max_depth | train_accuracy | val_accuracy | |
---|---|---|---|

0 | 1 | 0.655556 | 0.6240 |

1 | 2 | 0.800000 | 0.7915 |

2 | 3 | 0.868333 | 0.8580 |

3 | 4 | 0.940000 | 0.9330 |

4 | 5 | 0.945000 | 0.9285 |

5 | 6 | 0.953333 | 0.9225 |

6 | 7 | 0.958889 | 0.9195 |

7 | 8 | 0.967778 | 0.9150 |

Let's find the highest `val_accuracy`

in this dataframe, then pull out the best depth from that:

In [15]:

```
best_row = df.val_accuracy.idxmax()
best_depth = df.loc[best_row]["max_depth"]
df.loc[best_row]
```

Out[15]:

max_depth 4.000 train_accuracy 0.940 val_accuracy 0.933 Name: 3, dtype: float64

`max_depth=4`

is the best hyper-parameter. Let's train a model with `max_depth=4`

, and test it to see the accuracy:

In [16]:

```
clf = DecisionTreeClassifier(max_depth=best_depth)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
```

Out[16]:

0.945

What data should given as input to the hyper-parameter optimization process? i.e., what's the input to

`get_best_depth`

below?X, y = pd.read_csv(...) X_train, X_test, y_train, y_test = train_test_split(X, y) max_depth = get_best_depth(...) # what goes here? print(max_depth) # prints "4" clf = DecisionTreeClassifier(max_depth=max_depth) clf.fit(X_train, y_train) clf.score(X_test, y_test)

- Input: the train data.
- Input: the test data.
- Input: the test and train data.
- Input: only the (untrained) model. The best hyper-parameters only depend on the model.

Why shouldn't the same dataset be used to train a model and evaluate performance of that model?

(the data used to train a model is the "train data", and the data used to evaluate the performance is the "test data")

- Because the model has seen the training data before. What if the model just memorized the answers for the training data?
- Because the model's goal is to perform well on
*unseen*data. The best way to do that is train on one dataset and test on another. - Because the model's goal is to perform well on
*unseen*data. Why would testing on data it's already seen before be a good evaluation of that goal? - It's okay to train on the test data because of "big data" and with the underlying algorithms.

Rerun this notebook for the two other datasets in the first cell. Which one has prodces the largest gap between "train_accuracy" and "val_accuracy" in the plot? (either definition of

`get_max_depth_score`

can be used)

- make_circles dataset
- make_blobs dataset
- make_moons dataset

In [ ]:

```
```