A Summary of lecture "Machine Learning with Tree-Based Models in Python", via datacamp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
max_depth
, min_samples_leaf
, splitting criterion, ...In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.
Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.
indian = pd.read_csv('./datasets/indian_liver_patient_preprocessed.csv', index_col=0)
indian.head()
Age_std | Total_Bilirubin_std | Direct_Bilirubin_std | Alkaline_Phosphotase_std | Alamine_Aminotransferase_std | Aspartate_Aminotransferase_std | Total_Protiens_std | Albumin_std | Albumin_and_Globulin_Ratio_std | Is_male_std | Liver_disease | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.247403 | -0.420320 | -0.495414 | -0.428870 | -0.355832 | -0.319111 | 0.293722 | 0.203446 | -0.147390 | 0 | 1 |
1 | 1.062306 | 1.218936 | 1.423518 | 1.675083 | -0.093573 | -0.035962 | 0.939655 | 0.077462 | -0.648461 | 1 | 1 |
2 | 1.062306 | 0.640375 | 0.926017 | 0.816243 | -0.115428 | -0.146459 | 0.478274 | 0.203446 | -0.178707 | 1 | 1 |
3 | 0.815511 | -0.372106 | -0.388807 | -0.449416 | -0.366760 | -0.312205 | 0.293722 | 0.329431 | 0.165780 | 1 | 1 |
4 | 1.679294 | 0.093956 | 0.179766 | -0.395996 | -0.295731 | -0.177537 | 0.755102 | -0.930414 | -1.713237 | 1 | 1 |
X = indian.drop('Liver_disease', axis='columns')
y = indian['Liver_disease']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
from sklearn.tree import DecisionTreeClassifier
# Instantiate dt
dt = DecisionTreeClassifier()
# Check default hyperparameter
dt.get_params()
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': None, 'splitter': 'best'}
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree dt
and find the optimal classifier in the next exercise.
# Define params_dt
params_dt = {
'max_depth': [2, 3, 4],
'min_samples_leaf': [0.12, 0.14, 0.16, 0.18],
}
In this exercise, you'll perform grid search using 5-fold cross validation to find dt
's optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV
object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit()
method:
grid_object.fit(X_train, y_train)
from sklearn.model_selection import GridSearchCV
# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='roc_auc', cv=5, n_jobs=-1)
grid_dt.fit(X_train, y_train)
GridSearchCV(cv=5, error_score=nan, estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'), iid='deprecated', n_jobs=-1, param_grid={'max_depth': [2, 3, 4], 'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='roc_auc', verbose=0)
In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.
In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the method predict_proba()
of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.
from sklearn.metrics import roc_auc_score
# Extract the best estimator
best_model = grid_dt.best_estimator_
# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)
# Print test_roc_auc
print("Test set ROC AUC score: {:.3f}".format(test_roc_auc))
Test set ROC AUC score: 0.681
In the following exercises, you'll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor.
bike = pd.read_csv('./datasets/bikes.csv')
bike.head()
hr | holiday | workingday | temp | hum | windspeed | cnt | instant | mnth | yr | Clear to partly cloudy | Light Precipitation | Misty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0.76 | 0.66 | 0.0000 | 149 | 13004 | 7 | 1 | 1 | 0 | 0 |
1 | 1 | 0 | 0 | 0.74 | 0.70 | 0.1343 | 93 | 13005 | 7 | 1 | 1 | 0 | 0 |
2 | 2 | 0 | 0 | 0.72 | 0.74 | 0.0896 | 90 | 13006 | 7 | 1 | 1 | 0 | 0 |
3 | 3 | 0 | 0 | 0.72 | 0.84 | 0.1343 | 33 | 13007 | 7 | 1 | 1 | 0 | 0 |
4 | 4 | 0 | 0 | 0.70 | 0.79 | 0.1940 | 4 | 13008 | 7 | 1 | 1 | 0 | 0 |
X = bike.drop('cnt', axis='columns')
y = bike['cnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
from sklearn.ensemble import RandomForestRegressor
# Instantiate rf
rf = RandomForestRegressor()
# Get hyperparameters
rf.get_params()
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune rf
's hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.
# Define the dicrionary 'params_rf'
params_rf = {
'n_estimators': [100, 350, 500],
'max_features': ['log2', 'auto', 'sqrt'],
'min_samples_leaf': [2, 10, 30],
}
In this exercise, you'll perform grid search using 3-fold cross validation to find rf
's optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.
Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV
object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit()
method:
grid_object.fit(X_train, y_train)
from sklearn.model_selection import GridSearchCV
# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf, param_grid=params_rf, scoring='neg_mean_squared_error', cv=3,
verbose=1, n_jobs=-1)
# fit model
grid_rf.fit(X_train, y_train)
Fitting 3 folds for each of 27 candidates, totalling 81 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 1.6s [Parallel(n_jobs=-1)]: Done 81 out of 81 | elapsed: 3.7s finished
GridSearchCV(cv=3, error_score=nan, estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False), iid='deprecated', n_jobs=-1, param_grid={'max_features': ['log2', 'auto', 'sqrt'], 'min_samples_leaf': [2, 10, 30], 'n_estimators': [100, 350, 500]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='neg_mean_squared_error', verbose=1)
In this last exercise of the course, you'll evaluate the test set RMSE of grid_rf
's optimal model.
from sklearn.metrics import mean_squared_error as MSE
# Extract the best estimator
best_model = grid_rf.best_estimator_
# Predict test set labels
y_pred = best_model.predict(X_test)
# Compute rmse_test
rmse_test = MSE(y_test, y_pred) ** 0.5
# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))
Test RMSE of best model: 54.358