Notebook

Genetic Programming-Based Hyperparameter Optimization of a Decision Tree¶

This notebook shows how to use sklearn-genetic-opt for hyperparameter optimization based on genetic algorithms (evolutionary programming). If you are interested in understanding how it works, sklearn-genetic-opt is using DEAP under the hood.

In [1]:

%load_ext watermark
%watermark -p scikit-learn,sklearn,deap,sklearn_genetic

scikit-learn   : 1.0
sklearn        : 1.0
deap           : 1.3.1
sklearn_genetic: 0.7.0

Dataset¶

In [2]:

from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn import datasets


data = datasets.load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

X_train_sub, X_valid, y_train_sub, y_valid = \
    train_test_split(X_train, y_train, test_size=0.2, random_state=1, stratify=y_train)

print('Train/Valid/Test sizes:', y_train.shape[0], y_valid.shape[0], y_test.shape[0])

Train/Valid/Test sizes: 398 80 171

sklearn-genetic-opt¶

Install: pip install sklearn-genetic-opt[all]
More info: https://sklearn-genetic-opt.readthedocs.io/en/stable/#

In [3]:

import numpy as np
import scipy.stats

from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Integer, Categorical, Continuous
from sklearn.tree import DecisionTreeClassifier


clf = DecisionTreeClassifier(random_state=123)

params =  {
    'min_samples_split': Integer(2, 12),
    'min_impurity_decrease': Continuous(0.0, 0.5),
    'max_depth': Categorical([6, 16, None])
}

search = GASearchCV(
    estimator=clf,
    cv=5,
    population_size=15,
    generations=20,
    tournament_size=3,
    elitism=True,
    keep_top_k=4,
    crossover_probability=0.9,
    mutation_probability=0.05,
    param_grid=params,
    criteria='max',
    algorithm='eaMuCommaLambda',
    n_jobs=-1)

search.fit(X_train, y_train)

search.best_score_

gen	nevals	fitness 	fitness_std	fitness_max	fitness_min
0  	15    	0.773962	0.131052   	0.914778   	0.628165   
1  	28    	0.888608	0.0588224  	0.914778   	0.673165   
2  	29    	0.911424	0.00855215 	0.914778   	0.88962    
3  	28    	0.914778	4.44089e-16	0.914778   	0.914778   
4  	28    	0.914778	4.44089e-16	0.914778   	0.914778   
5  	28    	0.914778	4.44089e-16	0.914778   	0.914778   
6  	29    	0.914778	4.44089e-16	0.914778   	0.914778   
7  	27    	0.918297	0.00703797 	0.932373   	0.914778   
8  	27    	0.922989	0.0087779  	0.932373   	0.914778   
9  	29    	0.928854	0.00703797 	0.932373   	0.914778   
10 	29    	0.932373	3.33067e-16	0.932373   	0.932373   
11 	29    	0.932373	3.33067e-16	0.932373   	0.932373   
12 	29    	0.932373	3.33067e-16	0.932373   	0.932373   
13 	29    	0.932861	0.000974684	0.93481    	0.932373   
14 	29    	0.933023	0.00107755 	0.93481    	0.932373   
15 	28    	0.93416 	0.00107755 	0.93481    	0.932373   
16 	29    	0.93481 	3.33067e-16	0.93481    	0.93481    
17 	29    	0.93481 	3.33067e-16	0.93481    	0.93481    
18 	29    	0.93481 	3.33067e-16	0.93481    	0.93481    
19 	28    	0.93481 	3.33067e-16	0.93481    	0.93481    
20 	29    	0.93481 	3.33067e-16	0.93481    	0.93481

Out[3]:

0.9348101265822784

In [4]:

search.best_params_

Out[4]:

{'min_samples_split': 8,
 'min_impurity_decrease': 0.006258039752250311,
 'max_depth': 16}

In [5]:

print(f"Training Accuracy: {search.best_estimator_.score(X_train, y_train):0.2f}")
print(f"Test Accuracy: {search.best_estimator_.score(X_test, y_test):0.2f}")

Training Accuracy: 0.99
Test Accuracy: 0.94