This chapter introduces you to a popular automated hyperparameter tuning methodology called Grid Search. You will learn what it is, how it works and practice undertaking a Grid Search using Scikit Learn. You will then learn how to analyze the output of a Grid Search & gain practical experience doing this. This is the Summary of lecture "Hyperparameter Tuning in Python", via datacamp.
import pandas as pd
import numpy as np
from pprint import pprint
In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.
In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise.
from sklearn.model_selection import train_test_split
credit_card = pd.read_csv('./dataset/credit-card-full.csv')
# To change categorical variable with dummy variables
credit_card = pd.get_dummies(credit_card, columns=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)
X = credit_card.drop(['ID', 'default payment next month'], axis=1)
y = credit_card['default payment next month']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Create the function
def gbm_grid_search(learn_rate, max_depth):
# Create the model
model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth)
# Use the model to make predictions
predictions = model.fit(X_train, y_train).predict(X_test)
# Return the hyperparameters and score
return ([learn_rate, max_depth, accuracy_score(y_test, predictions)])
In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter.
# Create the relevant lists
results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2, 4, 6]
# Create the for loop
for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
results_list.append(gbm_grid_search(learn_rate, max_depth))
# Print the results
pprint(results_list)
[[0.01, 2, 0.8214444444444444], [0.01, 4, 0.8198888888888889], [0.01, 6, 0.8172222222222222], [0.1, 2, 0.8205555555555556], [0.1, 4, 0.8204444444444444], [0.1, 6, 0.8204444444444444], [0.5, 2, 0.8188888888888889], [0.5, 4, 0.8042222222222222], [0.5, 6, 0.7894444444444444]]
# Extend the function input
def gbm_grid_search_extended(learn_rate, max_depth, subsample):
# Extend the model creation section
model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth,
subsample=subsample)
predictions = model.fit(X_train, y_train).predict(X_test)
# Extend the return part
return([learn_rate, max_depth, subsample, accuracy_score(y_test, predictions)])
# Create the new list to test
subsample_list = [0.4, 0.6]
for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
# Extend the for loop
for subsample in subsample_list:
# Extend the results to include the new hyperparameter
results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))
# Print the results
pprint(results_list)
[[0.01, 2, 0.8214444444444444], [0.01, 4, 0.8198888888888889], [0.01, 6, 0.8172222222222222], [0.1, 2, 0.8205555555555556], [0.1, 4, 0.8204444444444444], [0.1, 6, 0.8204444444444444], [0.5, 2, 0.8188888888888889], [0.5, 4, 0.8042222222222222], [0.5, 6, 0.7894444444444444], [0.01, 2, 0.4, 0.8192222222222222], [0.01, 2, 0.6, 0.8208888888888889], [0.01, 4, 0.4, 0.8183333333333334], [0.01, 4, 0.6, 0.8195555555555556], [0.01, 6, 0.4, 0.8177777777777778], [0.01, 6, 0.6, 0.8196666666666667], [0.1, 2, 0.4, 0.821], [0.1, 2, 0.6, 0.8201111111111111], [0.1, 4, 0.4, 0.8207777777777778], [0.1, 4, 0.6, 0.8196666666666667], [0.1, 6, 0.4, 0.8155555555555556], [0.1, 6, 0.6, 0.8183333333333334], [0.5, 2, 0.4, 0.8128888888888889], [0.5, 2, 0.6, 0.8156666666666667], [0.5, 4, 0.4, 0.7945555555555556], [0.5, 4, 0.6, 0.8065555555555556], [0.5, 6, 0.4, 0.7714444444444445], [0.5, 6, 0.6, 0.7743333333333333]]
The GridSearchCV
module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a GridSearchCV
object with certain parameters.
The desired options are:
max_depth
(2, 4, 8, 15) and max_features
('auto' vs 'sqrt')roc_auc
to score the modelsfrom sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')
# Create the parametergrid
param_grid = {
'max_depth':[2, 4, 8, 15],
'max_features':['auto', 'sqrt']
}
# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
estimator=rf_class,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=4,
cv=5,
refit=True,
return_train_score=True
)
print(grid_rf_class)
GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'), n_jobs=4, param_grid={'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']}, return_train_score=True, scoring='roc_auc')
You will now explore the cv_results_
property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.
A reminder of the different column types in this property:
time_
columnsparam_
columns (one for each hyperparameter) and the singular params column (with all hyperparameter settings)train_score
column for each cv fold including the mean_train_score and std_train_score columnstest_score
column for each cv fold including the mean_test_score and std_test_score columnsrank_test_score
column with a number from 1 to n (number of iterations) ranking the rows based on their mean_test_score
grid_rf_class.fit(X_train, y_train)
# Read the cv_results property into adataframe & print it out
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)
# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:, ["params"]]
print(column)
# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1]
print(best_row)
mean_fit_time std_fit_time mean_score_time std_score_time \ 0 0.633362 0.012019 0.022155 0.000390 1 0.641156 0.005938 0.021984 0.000733 2 1.075554 0.013504 0.025514 0.000493 3 1.078453 0.010363 0.025199 0.000147 4 1.944722 0.012599 0.033582 0.000308 5 1.945425 0.028632 0.034088 0.000777 6 3.206863 0.097319 0.053727 0.002878 7 3.147933 0.044344 0.051127 0.000389 param_max_depth param_max_features \ 0 2 auto 1 2 sqrt 2 4 auto 3 4 sqrt 4 8 auto 5 8 sqrt 6 15 auto 7 15 sqrt params split0_test_score \ 0 {'max_depth': 2, 'max_features': 'auto'} 0.762140 1 {'max_depth': 2, 'max_features': 'sqrt'} 0.764059 2 {'max_depth': 4, 'max_features': 'auto'} 0.770780 3 {'max_depth': 4, 'max_features': 'sqrt'} 0.771145 4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 5 {'max_depth': 8, 'max_features': 'sqrt'} 0.775533 6 {'max_depth': 15, 'max_features': 'auto'} 0.764570 7 {'max_depth': 15, 'max_features': 'sqrt'} 0.768063 split1_test_score split2_test_score ... mean_test_score std_test_score \ 0 0.765329 0.761279 ... 0.766216 0.006197 1 0.763412 0.762643 ... 0.766549 0.006738 2 0.767779 0.767878 ... 0.772387 0.005807 3 0.767072 0.768481 ... 0.772561 0.006485 4 0.774657 0.774855 ... 0.778609 0.005532 5 0.773319 0.775794 ... 0.777846 0.005482 6 0.767184 0.773903 ... 0.771833 0.005240 7 0.770179 0.775381 ... 0.773331 0.003961 rank_test_score split0_train_score split1_train_score \ 0 8 0.769725 0.770268 1 7 0.769879 0.770792 2 5 0.780189 0.780651 3 4 0.780138 0.781665 4 1 0.829683 0.829852 5 2 0.830955 0.830563 6 6 0.977494 0.973979 7 3 0.975494 0.973033 split2_train_score split3_train_score split4_train_score \ 0 0.770262 0.769099 0.766369 1 0.771637 0.767746 0.765142 2 0.781025 0.780711 0.777592 3 0.781055 0.780815 0.777521 4 0.829912 0.829288 0.829285 5 0.830177 0.831415 0.827246 6 0.974724 0.974242 0.976732 7 0.976973 0.974084 0.973877 mean_train_score std_train_score 0 0.769145 0.001453 1 0.769039 0.002340 2 0.780034 0.001249 3 0.780239 0.001444 4 0.829604 0.000270 5 0.830071 0.001471 6 0.975434 0.001412 7 0.974692 0.001388 [8 rows x 22 columns] params 0 {'max_depth': 2, 'max_features': 'auto'} 1 {'max_depth': 2, 'max_features': 'sqrt'} 2 {'max_depth': 4, 'max_features': 'auto'} 3 {'max_depth': 4, 'max_features': 'sqrt'} 4 {'max_depth': 8, 'max_features': 'auto'} 5 {'max_depth': 8, 'max_features': 'sqrt'} 6 {'max_depth': 15, 'max_features': 'auto'} 7 {'max_depth': 15, 'max_features': 'sqrt'} mean_fit_time std_fit_time mean_score_time std_score_time \ 4 1.944722 0.012599 0.033582 0.000308 param_max_depth param_max_features \ 4 8 auto params split0_test_score \ 4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 split1_test_score split2_test_score ... mean_test_score std_test_score \ 4 0.774657 0.774855 ... 0.778609 0.005532 rank_test_score split0_train_score split1_train_score \ 4 1 0.829683 0.829852 split2_train_score split3_train_score split4_train_score \ 4 0.829912 0.829288 0.829285 mean_train_score std_train_score 4 0.829604 0.00027 [1 rows x 22 columns]
At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's gridSearchCV
objects have a number of parameters that provide key information on just the best square (or row in cv_results_
).
Three properties you will explore are:
best_score_
– The score (here ROC_AUC) from the best-performing square.best_index_
– The index of the row in cv_results_
containing information on the best-performing square.best_params_
– A dictionary of the parameters that gave the best score, for example 'max_depth': 10# Print out the ROC_AUC score from the best-performing square
best_score = grid_rf_class.best_score_
print(best_score)
# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[[grid_rf_class.best_index_]]
print(best_row)
# Get the max_depth parameter from the best-performing square and print
best_max_depth = grid_rf_class.best_params_['max_depth']
print(best_max_depth)
0.7786085423910816 mean_fit_time std_fit_time mean_score_time std_score_time \ 4 1.944722 0.012599 0.033582 0.000308 param_max_depth param_max_features \ 4 8 auto params split0_test_score \ 4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 split1_test_score split2_test_score ... mean_test_score std_test_score \ 4 0.774657 0.774855 ... 0.778609 0.005532 rank_test_score split0_train_score split1_train_score \ 4 1 0.829683 0.829852 split2_train_score split3_train_score split4_train_score \ 4 0.829912 0.829288 0.829285 mean_train_score std_train_score 4 0.829604 0.00027 [1 rows x 22 columns] 8
While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.
We can access this object through the best_estimator_
property of our grid search object.
In this exercise we will take a look inside the best_estimator_
property and then use this to make predictions on our test set for credit card defaults and generate a variety of scores. Remember to use predict_proba
rather than predict
since we need probability values rather than class labels for our roc_auc score. We use a slice [:,1]
to get probabilities of the positive class.
from sklearn.metrics import confusion_matrix, roc_auc_score
# See what type of object the best_estimator_property is
print(type(grid_rf_class.best_estimator_))
# Create an array of predictions directly using the best_estimator_property
predictions = grid_rf_class.best_estimator_.predict(X_test)
# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])
# Now create a confusion matrix
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))
# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:, 1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))
<class 'sklearn.ensemble._forest.RandomForestClassifier'> [1 0 0 1 0] Confusion Matrix [[6685 323] [1292 700]] ROC-AUC Score 0.7767071783137115
The .best_estimator_
property is a really powerful property to understand for streamlining your machine learning model building process. You now can run a grid search and seamlessly use the best model from that search to make predictions.