The process of learning a predictive model is driven by a set of internal parameters and a set of training data. These internal parameters are called hyper-parameters and are specific for each family of models. In addition, a specific set of parameters are optimal for a specific dataset and thus they need to be optimized.
This notebook shows:
import pandas as pd
df = pd.read_csv(
"https://www.openml.org/data/get_csv/1595261/adult-census.csv")
# Or use the local copy:
# df = pd.read_csv(os.path.join("..", "datasets", "adult-census.csv"))
target_name = "class"
target = df[target_name].to_numpy()
target
array([' <=50K', ' <=50K', ' >50K', ..., ' <=50K', ' <=50K', ' >50K'], dtype=object)
data = df.drop(columns=[target_name, "fnlwgt"])
data.head()
age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
1 | 38 | Private | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
2 | 28 | Local-gov | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
3 | 44 | Private | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
4 | 18 | ? | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States |
Once the dataset is loaded, we split it into a training and testing sets.
from sklearn.model_selection import train_test_split
df_train, df_test, target_train, target_test = train_test_split(
data, target, random_state=42)
Then, we define the preprocessing pipeline to transform differently the numerical and categorical data.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
categorical_columns = [
'workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'native-country', 'sex']
categories = [
data[column].unique() for column in data[categorical_columns]]
categorical_preprocessor = OrdinalEncoder(categories=categories)
preprocessor = ColumnTransformer([
('cat-preprocessor', categorical_preprocessor,
categorical_columns),], remainder='passthrough',
sparse_threshold=0)
Finally, we use a tree-based classifier (i.e. histogram gradient-boosting) to predict whether or not a person earns more than 50,000 dollars a year.
%%time
# for the moment this line is required to import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
model = Pipeline([
("preprocessor", preprocessor),
("classifier",
HistGradientBoostingClassifier(max_leaf_nodes=16,
learning_rate=0.05,
random_state=42)),])
model.fit(df_train, target_train)
print(
f"The test accuracy score of the gradient boosting pipeline is: "
f"{model.score(df_test, target_test):.2f}")
The test accuracy score of the gradient boosting pipeline is: 0.88 CPU times: user 10.9 s, sys: 342 ms, total: 11.2 s Wall time: 5.49 s
What is the default value of the learning_rate
parameter of the HistGradientBoostingClassifier
class? (link to the API documentation)
Try to edit the code of the previous cell to set the learning rate parameter to 10. Does this increase the accuracy of the model?
Decrease progressively value of learning_rate
: can you find a value that yields an accuracy higher than 0.87?
Fix learning_rate
to 0.05 and try setting the value of max_leaf_nodes
to the minimum value of 2. Does not improve the accuracy?
Try to progressively increase the value of max_leaf_nodes
to 256 by taking powers of 2. What do you observe?
In the previous example, we created an histogram gradient-boosting classifier using the default parameters by omitting to explicitely set these parameters.
However, there is no reasons that this set of parameters are optimal for our
dataset. For instance, fine-tuning the histogram gradient-boosting can be
achieved by finding the best combination of the following parameters: (i)
learning_rate
, (ii) min_samples_leaf
, and (iii) max_leaf_nodes
.
Nevertheless, finding this combination manually will be tedious. Indeed,
there are relationship between these parameters which are difficult to find
manually: increasing the depth of trees (increasing max_samples_leaf
)
should be associated with a lower learning-rate.
Scikit-learn provides tools to explore and evaluate the parameters space.
Our goal is to find the best combination of the parameters stated above.
In short, we will set these parameters with some defined values, train our model on some data, and evaluate the model performance on some left out data. Ideally, we will select the parameters leading to the optimal performance on the testing set.
The first step is to find the name of the parameters to be set. We use the
method get_params()
to get this information. For instance, for a single
model like the HistGradientBoostingClassifier
, we can get the list such as:
print("The hyper-parameters are for a histogram GBDT model are:")
for param_name in HistGradientBoostingClassifier().get_params().keys(
):
print(param_name)
The hyper-parameters are for a histogram GBDT model are: l2_regularization learning_rate loss max_bins max_depth max_iter max_leaf_nodes min_samples_leaf n_iter_no_change random_state scoring tol validation_fraction verbose warm_start
When the model of interest is a Pipeline
, i.e. a serie of transformers and
a predictor, the name of the estimator will be added at the front of the
parameter name with a double underscore ("dunder") in-between (e.g.
estimator__parameters
).
print("The hyper-parameters are for the full-pipeline are:")
for param_name in model.get_params().keys():
print(param_name)
The hyper-parameters are for the full-pipeline are: memory steps verbose preprocessor classifier preprocessor__n_jobs preprocessor__remainder preprocessor__sparse_threshold preprocessor__transformer_weights preprocessor__transformers preprocessor__verbose preprocessor__cat-preprocessor preprocessor__cat-preprocessor__categories preprocessor__cat-preprocessor__dtype classifier__l2_regularization classifier__learning_rate classifier__loss classifier__max_bins classifier__max_depth classifier__max_iter classifier__max_leaf_nodes classifier__min_samples_leaf classifier__n_iter_no_change classifier__random_state classifier__scoring classifier__tol classifier__validation_fraction classifier__verbose classifier__warm_start
The parameters that we want to set are:
'classifier__learning_rate'
: this parameter will
control the ability of a new tree to correct the error of the previous
sequence of trees;'classifier__max_leaf_nodes'
: this parameter will
control the depth of each tree.Use the previously defined model (called model
) and using two nested for
loops, make a search of the best combinations of the learning_rate
and
max_leaf_nodes
parameters. In this regard, you will need to train and test
the model by setting the parameters. The evaluation of the model should be
performed using cross_val_score
. We can propose to define the following
parameters search:
learning_rate
for the values 0.01, 0.1, and 1;max_leaf_nodes
for the values 5, 25, 45.Instead of manually writting the two for
loops, scikit-learn provides a
class called GridSearchCV
which implement the exhaustive search implemented
during the exercise.
Let see how to use the GridSearchCV
estimator for doing such search.
Since the grid-search will be costly, we will only explore the combination
learning-rate and the maximum number of nodes.
%%time
import numpy as np
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__learning_rate': (0.05, 0.1, 0.5, 1, 5),
'classifier__max_leaf_nodes': (3, 10, 30, 100),}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
n_jobs=4, cv=2)
model_grid_search.fit(df_train, target_train)
print(f"The test accuracy score of the grid-searched pipeline is: "
f"{model_grid_search.score(df_test, target_test):.2f}")
/home/lesteve/miniconda3/envs/scikit-learn-tutorial/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. "timeout or by a memory leak.", UserWarning
The test accuracy score of the grid-searched pipeline is: 0.88 CPU times: user 8.52 s, sys: 302 ms, total: 8.82 s Wall time: 22.1 s
The GridSearchCV
estimator takes a param_grid
parameter which defines
all hyper-parameters and their associated values. The grid-search will be in
charge of creating all possible combinations and test them.
The number of combinations will be equal to the cardesian product of the number of values to explore for each parameter (e.g. in our example 3 x 3 combinations). Thus, adding new parameters with their associated values to be explored become rapidly computationally expensive.
Once the grid-search is fitted, it can be used as any other predictor by
calling predict
and predict_proba
. Internally, it will use the model with
the best parameters found during fit
.
Get predictions for the 5 first samples using the estimator with the best parameters.
model_grid_search.predict(df_test.iloc[0:5])
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)
You can know about these parameters by looking at the best_params_
attribute.
print(f"The best set of parameters is: "
f"{model_grid_search.best_params_}")
The best set of parameters is: {'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
In addition, we can inspect all results which are stored in the attribute
cv_results_
of the grid-search. We will filter some specific columns to
from these results
cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
"mean_test_score", ascending=False)
cv_results.head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_classifier__learning_rate | param_classifier__max_leaf_nodes | params | split0_test_score | split1_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 1.365516 | 0.027808 | 0.305787 | 0.036408 | 0.1 | 30 | {'classifier__learning_rate': 0.1, 'classifier... | 0.868967 | 0.869233 | 0.869100 | 0.000133 | 1 |
2 | 1.361445 | 0.143995 | 0.351852 | 0.029814 | 0.05 | 30 | {'classifier__learning_rate': 0.05, 'classifie... | 0.867766 | 0.867922 | 0.867844 | 0.000078 | 2 |
3 | 3.468121 | 0.072954 | 0.497876 | 0.035772 | 0.05 | 100 | {'classifier__learning_rate': 0.05, 'classifie... | 0.868366 | 0.866940 | 0.867653 | 0.000713 | 3 |
5 | 0.884160 | 0.022427 | 0.309784 | 0.026364 | 0.1 | 10 | {'classifier__learning_rate': 0.1, 'classifier... | 0.866401 | 0.867977 | 0.867189 | 0.000788 | 4 |
8 | 0.576520 | 0.043608 | 0.228952 | 0.031958 | 0.5 | 3 | {'classifier__learning_rate': 0.5, 'classifier... | 0.864763 | 0.867158 | 0.865961 | 0.001198 | 5 |
Let us focus on the most interesting columns and shorten the parameter names to remove the "param_classifier__"
prefix for readability:
# get the parameter names
column_results = [f"param_{name}" for name in param_grid.keys()]
column_results += [
"mean_test_score", "std_test_score", "rank_test_score"]
cv_results = cv_results[column_results]
def shorten_param(param_name):
if "__" in param_name:
return param_name.rsplit("__", 1)[1]
return param_name
cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
learning_rate | max_leaf_nodes | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|
6 | 0.1 | 30 | 0.869100 | 0.000133 | 1 |
2 | 0.05 | 30 | 0.867844 | 0.000078 | 2 |
3 | 0.05 | 100 | 0.867653 | 0.000713 | 3 |
5 | 0.1 | 10 | 0.867189 | 0.000788 | 4 |
8 | 0.5 | 3 | 0.865961 | 0.001198 | 5 |
7 | 0.1 | 100 | 0.865933 | 0.001014 | 6 |
9 | 0.5 | 10 | 0.865769 | 0.000359 | 7 |
12 | 1 | 3 | 0.865251 | 0.000106 | 8 |
1 | 0.05 | 10 | 0.860637 | 0.000679 | 9 |
10 | 0.5 | 30 | 0.857771 | 0.000932 | 10 |
13 | 1 | 10 | 0.853730 | 0.005682 | 11 |
4 | 0.1 | 3 | 0.853457 | 0.000168 | 12 |
0 | 0.05 | 3 | 0.848926 | 0.001415 | 13 |
11 | 0.5 | 100 | 0.845759 | 0.001424 | 14 |
14 | 1 | 30 | 0.825121 | 0.003335 | 15 |
15 | 1 | 100 | 0.800988 | 0.000049 | 16 |
16 | 5 | 3 | 0.758947 | 0.000007 | 17 |
17 | 5 | 10 | 0.758947 | 0.000007 | 17 |
19 | 5 | 100 | 0.758947 | 0.000007 | 17 |
18 | 5 | 30 | 0.499993 | 0.258947 | 20 |
With only 2 parameters, we might want to visualize the grid-search as a
heatmap. We need to transform our cv_results
into a dataframe where the
rows will correspond to the learning-rate values and the columns will
correspond to the maximum number of leaf and the content of the dataframe
will be the mean test scores.
pivoted_cv_results = cv_results.pivot_table(
values="mean_test_score", index=["learning_rate"],
columns=["max_leaf_nodes"])
pivoted_cv_results
max_leaf_nodes | 3 | 10 | 30 | 100 |
---|---|---|---|---|
learning_rate | ||||
0.05 | 0.848926 | 0.860637 | 0.867844 | 0.867653 |
0.10 | 0.853457 | 0.867189 | 0.869100 | 0.865933 |
0.50 | 0.865961 | 0.865769 | 0.857771 | 0.845759 |
1.00 | 0.865251 | 0.853730 | 0.825121 | 0.800988 |
5.00 | 0.758947 | 0.758947 | 0.499993 | 0.758947 |
import matplotlib.pyplot as plt
from seaborn import heatmap
ax = heatmap(pivoted_cv_results, annot=True, cmap="YlGnBu", vmin=0.7,
vmax=0.9)
ax.invert_yaxis()
The above tables highlights the following things:
learning_rate
, the performance of the model is degraded and adjusting the value of max_leaf_nodes
cannot fix that problem;max_leaf_nodes
depends on the value of learning_rate
;max_leaf_nodes
is increased, one should increase the value of learning_rate
accordingly to preserve a good accuracy.The precise meaning of those two parameters will be explained in a latter notebook.
For now we will note that, in general, there is no unique optimal parameter setting: 6 models out of the 16 parameter configuration reach the maximal accuracy (up to smal random fluctuations caused by the sampling of the training set).
With the GridSearchCV
estimator, the parameters need to be specified
explicitely. We mentioned that exploring a large number of values for
different parameters will be quickly untractable.
Instead, we can randomly generate the parameter candidates. The
RandomSearchCV
allows for such stochastic search. It is used similarly to
the GridSearchCV
but the sampling distributions need to be specified
instead of the parameter values. For instance, we will draw candidates using
a log-uniform distribution also called reciprocal distribution. In addition,
we will optimize 2 other parameters:
max_iter
: it corresponds to the number of trees in the ensemble;min_samples_leaf
: it corresponds to the minimum number of samples
required in a leaf.from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint
class reciprocal_int:
"""Integer valued version of the log-uniform distribution"""
def __init__(self, a, b):
self._distribution = reciprocal(a, b)
def rvs(self, *args, **kwargs):
"""Random variable sample"""
return self._distribution.rvs(*args, **kwargs).astype(int)
param_distributions = {
'classifier__l2_regularization': reciprocal(1e-6, 1e3),
'classifier__learning_rate': reciprocal(0.001, 10),
'classifier__max_leaf_nodes': reciprocal_int(2, 256),
'classifier__min_samples_leaf': reciprocal_int(1, 100),
'classifier__max_bins': reciprocal_int(2, 255),}
model_random_search = RandomizedSearchCV(
model, param_distributions=param_distributions, n_iter=10,
n_jobs=4, cv=5)
model_random_search.fit(df_train, target_train)
print(f"The test accuracy score of the best model is "
f"{model_random_search.score(df_test, target_test):.2f}")
The test accuracy score of the best model is 0.88
print("The best parameters are:")
pprint(model_random_search.best_params_)
The best parameters are: {'classifier__l2_regularization': 2.855328641639947e-05, 'classifier__learning_rate': 0.713602698128458, 'classifier__max_bins': 191, 'classifier__max_leaf_nodes': 5, 'classifier__min_samples_leaf': 2}
We can inspect the results using the attributes cv_results
as we previously
did.
# get the parameter names
column_results = [
f"param_{name}" for name in param_distributions.keys()]
column_results += [
"mean_test_score", "std_test_score", "rank_test_score"]
cv_results = pd.DataFrame(model_random_search.cv_results_)
cv_results = cv_results[column_results].sort_values(
"mean_test_score", ascending=False)
cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
l2_regularization | learning_rate | max_leaf_nodes | min_samples_leaf | max_bins | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|
4 | 2.85533e-05 | 0.713603 | 5 | 2 | 191 | 0.866343 | 0.002710 | 1 |
7 | 0.00179218 | 0.0946236 | 160 | 51 | 55 | 0.854468 | 0.002846 | 2 |
1 | 24.4374 | 0.371015 | 180 | 74 | 61 | 0.853676 | 0.002474 | 3 |
6 | 0.000202098 | 0.0261017 | 27 | 3 | 6 | 0.840845 | 0.003666 | 4 |
9 | 1.7478e-06 | 0.0176826 | 5 | 16 | 30 | 0.837542 | 0.002270 | 5 |
8 | 1.93292e-05 | 0.0788139 | 2 | 33 | 17 | 0.837433 | 0.004329 | 6 |
2 | 1.97283 | 1.57895 | 75 | 30 | 22 | 0.813300 | 0.004203 | 7 |
0 | 0.000197835 | 0.00500485 | 6 | 23 | 4 | 0.758947 | 0.000013 | 8 |
3 | 14.6707 | 0.00407391 | 4 | 22 | 4 | 0.758947 | 0.000013 | 8 |
5 | 2.14164e-06 | 0.00250808 | 70 | 59 | 140 | 0.758947 | 0.000013 | 8 |
In practice, a randomized hyper-parameter search is usually run with a large number of iterations. In order to avoid the computation cost and still make a decent analysis, we load the results obtained from a similar search with 200 iterations.
# model_random_search = RandomizedSearchCV(
# model, param_distributions=param_distributions, n_iter=500,
# n_jobs=4, cv=5)
# model_random_search.fit(df_train, target_train)
# cv_results = pd.DataFrame(model_random_search.cv_results_)
# cv_results.to_csv("../figures/randomized_search_results.csv")
cv_results = pd.read_csv("../figures/randomized_search_results.csv",
index_col=0)
As we have more than 2 paramters in our grid-search, we cannot visualize the results using a heatmap. However, we can us a parallel coordinates plot.
(cv_results[column_results].rename(
shorten_param, axis=1).sort_values("mean_test_score"))
l2_regularization | learning_rate | max_leaf_nodes | min_samples_leaf | max_bins | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|
357 | 0.000026 | 3.075318 | 3 | 68 | 31 | 0.241053 | 0.000013 | 500 |
200 | 0.000444 | 6.236325 | 2 | 2 | 30 | 0.344629 | 0.207156 | 499 |
413 | 0.000001 | 8.828574 | 64 | 1 | 144 | 0.448205 | 0.253714 | 497 |
344 | 0.000003 | 7.091079 | 5 | 1 | 95 | 0.448205 | 0.253714 | 497 |
232 | 0.000097 | 9.976823 | 28 | 5 | 3 | 0.448205 | 0.253714 | 496 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
327 | 4.733808 | 0.036786 | 61 | 5 | 241 | 0.869673 | 0.002417 | 5 |
328 | 2.036232 | 0.224702 | 28 | 49 | 236 | 0.869837 | 0.000808 | 4 |
21 | 4.994918 | 0.077047 | 53 | 7 | 192 | 0.870793 | 0.001993 | 3 |
343 | 0.000404 | 0.244503 | 15 | 15 | 229 | 0.871339 | 0.002741 | 2 |
208 | 0.011775 | 0.076653 | 24 | 2 | 155 | 0.871393 | 0.001588 | 1 |
500 rows × 8 columns
import plotly.express as px
fig = px.parallel_coordinates(
cv_results.rename(shorten_param, axis=1).apply({
"learning_rate": np.log10,
"max_leaf_nodes": np.log2,
"max_bins": np.log2,
"min_samples_leaf": np.log10,
"l2_regularization": np.log10,
"mean_test_score": lambda x: x,}),
color="mean_test_score",
color_continuous_scale=px.colors.sequential.Viridis,
)
fig.show()
The parallel coordinates plot will display the values of the hyper-parameters on different columns while the performance metric is color coded. Thus, we are able to quickly inspect if there is a range of hyper-parameters which is working or not.
Note that we transformed most axis values by taking a log10 or log2 to spread the active ranges and improve the readability of the plot.
It is possible to select a range of results by clicking and holding on any axis of the parallel coordinate plot. You can then slide (move) the range selection and cross two selections to see the intersections.
Quizz
Select the worst performing models (for instance models with a "mean_test_score" lower than 0.7): what do have all these moels in common (choose one):
too large l2_regularization |
|
too small l2_regularization |
|
too large learning_rate |
|
too low learning_rate |
|
too large max_bins |
|
too large max_bins |
Using the above plot, identify ranges of values for hyperparameter that always prevent the model to reach a test score higher than 0.86, irrespective of the other values:
True | False | |
---|---|---|
too large l2_regularization |
||
too small l2_regularization |
||
too large learning_rate |
||
too low learning_rate |
||
too large max_bins |
||
too large max_bins |
OneHotEncoder
and use
a StandardScaler
to normalize the numerical data.LogisticRegression
as a predictive model.RandomizedSearchCV
and tuning the
parameters:C
with values ranging from 0.001 to 10. You can use a reciprocal
distribution (i.e. scipy.stats.reciprocal
);solver
with possible values being "liblinear"
and "lbfgs"
;penalty
with possible values being "l2"
and "l1"
;drop
with possible values being None
or "first"
.You might get some FitFailedWarning
and try to explain why.
Cross-validation was used for searching for the best model parameters. We previously evaluated model performance through cross-validation as well. If we would like to combine both aspects, we need to perform a "nested" cross-validation. The "outer" cross-validation is applied to assess the model while the "inner" cross-validation sets the hyper-parameters of the model on the data set provided by the "outer" cross-validation.
In practice, it can be implemented by calling cross_val_score
or
cross_validate
on an instance of GridSearchCV
, RandomSearchCV
, or any
other EstimatorCV
class.
from sklearn.model_selection import cross_val_score
# recall the definition of our grid-search
param_distributions = {
'classifier__max_iter': reciprocal_int(10, 50),
'classifier__learning_rate': reciprocal(0.01, 10),
'classifier__max_leaf_nodes': reciprocal_int(2, 16),
'classifier__min_samples_leaf': reciprocal_int(1, 50),}
model_random_search = RandomizedSearchCV(
model, param_distributions=param_distributions, n_iter=10,
n_jobs=4, cv=5)
scores = cross_val_score(model_random_search, data, target, n_jobs=4,
cv=5)
print(f"The cross-validated accuracy score is:"
f" {scores.mean():.3f} +- {scores.std():.3f}")
The cross-validated accuracy score is: 0.866 +- 0.004
print("The scores obtained for each CV split are:")
print(scores)
The scores obtained for each CV split are: [0.86569761 0.86794964 0.86885749 0.86885749 0.85933661]
Be aware that the best model found for each split of the outer cross-validation loop might not share the same hyper-parameter values.
When analyzing such model, you should not only look at the overall model performance but look at the hyper-parameters variations as well.