Last Update: 06/08/2020
PyCaret Version: 2.0
Author: Richard Cornelius Suwandi
The goal of this project is to build a binary classification model to predict the sex of Palmer Penguins using PyCaret 2.0.
Installing PyCaret is very easy and takes only a few minutes. If you're using Azure notebooks or Google Colab, simply run the following code to install PyCaret.
# Install PyCaret 2.0
!pip install pycaret==2.0
# Import pandas for data loading and manipulation
import pandas as pd
# To render interactive plots in Google Colab
from pycaret.utils import enable_colab
enable_colab()
Colab mode activated.
The data used in this project is the Palmer Penguins dataset, which is released as an R package by Allison Horst.
The dataset consists of 4 numerical features:
And 2 categorical features:
You can find the cleaned CSV file in my GitHub.
# Load the data
data = pd.read_csv('penguins.csv')
data.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | Female |
3 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | Female |
4 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | Male |
# View the data description
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 333 entries, 0 to 332 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 333 non-null object 1 island 333 non-null object 2 bill_length_mm 333 non-null float64 3 bill_depth_mm 333 non-null float64 4 flipper_length_mm 333 non-null int64 5 body_mass_g 333 non-null int64 6 sex 333 non-null object dtypes: float64(2), int64(2), object(3) memory usage: 18.3+ KB
To start with PyCaret, the first step is to import all methods and attributes from PyCaret’s classification class.
from pycaret.classification import *
PyCaret workflow always starts with setup
function which prepares the environment for the entire machine learning pipeline. Thus, setup
must be executed before any other function.
clf = setup(
data=data,
target='sex',
train_size=0.8,
normalize=True,
session_id=123
)
Setup Succesfully Completed!
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Target Type | Binary |
2 | Label Encoded | Female: 0, Male: 1 |
3 | Original Data | (333, 7) |
4 | Missing Values | False |
5 | Numeric Features | 4 |
6 | Categorical Features | 2 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (333, 7) |
11 | Transformed Train Set | (266, 10) |
12 | Transformed Test Set | (67, 10) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | True |
16 | Normalize Method | zscore |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | False |
29 | Multicollinearity Threshold | None |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
42 | Fix Imbalance | False |
43 | Fix Imbalance Method | SMOTE |
As you can see above, the setup
function will handle the data preprocessing steps. It also splits the data into training and test sets.
Once the setup
is executed, we can use compare_models
to briefly evaluate the performance of all the models in the model library of PyCaret. This function train all the models available in the model library and scores them using Stratified Cross Validation. The output prints a score grid with Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC (averaged accross folds), determined by fold parameter.
# Compare and sort models by AUC
compare_models(sort='AUC')
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
0 | Extra Trees Classifier | 0.9209 | 0.9720 | 0.9176 | 0.9266 | 0.9194 | 0.8420 | 0.8466 | 0.2486 |
1 | Linear Discriminant Analysis | 0.9137 | 0.9710 | 0.9027 | 0.9235 | 0.9106 | 0.8274 | 0.8312 | 0.0068 |
2 | CatBoost Classifier | 0.8981 | 0.9705 | 0.9093 | 0.8919 | 0.8976 | 0.7961 | 0.8015 | 0.9557 |
3 | Extreme Gradient Boosting | 0.9135 | 0.9689 | 0.9324 | 0.9016 | 0.9152 | 0.8268 | 0.8307 | 0.0253 |
4 | Logistic Regression | 0.9023 | 0.9686 | 0.9253 | 0.8879 | 0.9029 | 0.8045 | 0.8110 | 0.0163 |
5 | K Neighbors Classifier | 0.8981 | 0.9655 | 0.9093 | 0.8953 | 0.8993 | 0.7960 | 0.8013 | 0.0042 |
6 | Light Gradient Boosting Machine | 0.8984 | 0.9645 | 0.9104 | 0.8958 | 0.8998 | 0.7966 | 0.8033 | 0.0229 |
7 | Random Forest Classifier | 0.8986 | 0.9633 | 0.9022 | 0.8995 | 0.8982 | 0.7969 | 0.8019 | 0.1125 |
8 | Gradient Boosting Classifier | 0.8870 | 0.9631 | 0.8863 | 0.8896 | 0.8846 | 0.7738 | 0.7794 | 0.0878 |
9 | Ada Boost Classifier | 0.8875 | 0.9513 | 0.8874 | 0.8933 | 0.8871 | 0.7743 | 0.7800 | 0.0837 |
10 | Decision Tree Classifier | 0.8724 | 0.8728 | 0.8742 | 0.8757 | 0.8728 | 0.7447 | 0.7484 | 0.0055 |
11 | Naive Bayes | 0.6691 | 0.8032 | 0.6341 | 0.6886 | 0.6539 | 0.3394 | 0.3451 | 0.0063 |
12 | Quadratic Discriminant Analysis | 0.6722 | 0.7663 | 0.6780 | 0.7273 | 0.6529 | 0.3413 | 0.3876 | 0.0053 |
13 | SVM - Linear Kernel | 0.8947 | 0.0000 | 0.9093 | 0.8963 | 0.8978 | 0.7886 | 0.7984 | 0.0073 |
14 | Ridge Classifier | 0.9135 | 0.0000 | 0.9099 | 0.9171 | 0.9115 | 0.8270 | 0.8301 | 0.0079 |
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
The next step is to create a model with selected algorithm using create_model
function. We just need to pass in the abbreviation of the model. You can check the docstring of the function to look for abbreviations.
# Get the ID of the models
help(create_model)
Help on function create_model in module pycaret.classification: create_model(estimator=None, ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=True, system=True, **kwargs) Description: ------------ This function creates a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold). This function returns a trained model object. setup() function must be called before using create_model() Example ------- from pycaret.datasets import get_data juice = get_data('juice') experiment_name = setup(data = juice, target = 'Purchase') lr = create_model('lr') This will create a trained Logistic Regression model. Parameters ---------- estimator : string / object, default = None Enter ID of the estimators available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. All estimators support binary or multiclass problem. List of estimators in model library: ID Name -------- ---------- 'lr' Logistic Regression 'knn' K Nearest Neighbour 'nb' Naive Bayes 'dt' Decision Tree Classifier 'svm' SVM - Linear Kernel 'rbfsvm' SVM - Radial Kernel 'gpc' Gaussian Process Classifier 'mlp' Multi Level Perceptron 'ridge' Ridge Classifier 'rf' Random Forest Classifier 'qda' Quadratic Discriminant Analysis 'ada' Ada Boost Classifier 'gbc' Gradient Boosting Classifier 'lda' Linear Discriminant Analysis 'et' Extra Trees Classifier 'xgboost' Extreme Gradient Boosting 'lightgbm' Light Gradient Boosting 'catboost' CatBoost Classifier ensemble: Boolean, default = False True would result in an ensemble of estimator using the method parameter defined. method: String, 'Bagging' or 'Boosting', default = None. method must be defined when ensemble is set to True. Default method is set to None. fold: integer, default = 10 Number of folds to be used in Kfold CV. Must be at least 2. round: integer, default = 4 Number of decimal places the metrics in the score grid will be rounded to. cross_validation: bool, default = True When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned. verbose: Boolean, default = True Score grid is not printed when verbose is set to False. system: Boolean, default = True Must remain True all times. Only to be changed by internal functions. **kwargs: Additional keyword arguments to pass to the estimator. Returns: -------- score grid: A table containing the scores of the model across the kfolds. ----------- Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are highlighted in yellow. model: trained model object ----------- Warnings: --------- - 'svm' and 'ridge' doesn't support predict_proba method. As such, AUC will be returned as zero (0.0) - If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0) - 'rbfsvm' and 'gpc' uses non-linear kernel and hence the fit time complexity is more than quadratic. These estimators are hard to scale on datasets with more than 10,000 samples.
# Create an Extra Trees Classifier
et = create_model('et')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8889 | 0.9670 | 0.9231 | 0.8571 | 0.8889 | 0.7781 | 0.7802 |
1 | 0.9630 | 0.9945 | 1.0000 | 0.9286 | 0.9630 | 0.9260 | 0.9286 |
2 | 0.9630 | 1.0000 | 1.0000 | 0.9333 | 0.9655 | 0.9256 | 0.9282 |
3 | 0.9259 | 0.9615 | 1.0000 | 0.8750 | 0.9333 | 0.8508 | 0.8605 |
4 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
5 | 0.8148 | 0.9451 | 0.7143 | 0.9091 | 0.8000 | 0.6322 | 0.6481 |
6 | 0.8846 | 0.9290 | 0.8462 | 0.9167 | 0.8800 | 0.7692 | 0.7715 |
7 | 0.9231 | 0.9941 | 0.9231 | 0.9231 | 0.9231 | 0.8462 | 0.8462 |
8 | 0.9231 | 0.9763 | 0.9231 | 0.9231 | 0.9231 | 0.8462 | 0.8462 |
9 | 0.9231 | 0.9527 | 0.8462 | 1.0000 | 0.9167 | 0.8462 | 0.8563 |
Mean | 0.9209 | 0.9720 | 0.9176 | 0.9266 | 0.9194 | 0.8420 | 0.8466 |
SD | 0.0484 | 0.0238 | 0.0888 | 0.0433 | 0.0524 | 0.0962 | 0.0931 |
# Tune the classifier
tuned_et = tune_model(et, optimize='AUC')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.9259 | 0.9670 | 0.9231 | 0.9231 | 0.9231 | 0.8516 | 0.8516 |
1 | 0.9630 | 1.0000 | 1.0000 | 0.9286 | 0.9630 | 0.9260 | 0.9286 |
2 | 0.9630 | 1.0000 | 1.0000 | 0.9333 | 0.9655 | 0.9256 | 0.9282 |
3 | 0.9259 | 0.9615 | 1.0000 | 0.8750 | 0.9333 | 0.8508 | 0.8605 |
4 | 0.9630 | 0.9973 | 0.9286 | 1.0000 | 0.9630 | 0.9260 | 0.9286 |
5 | 0.8519 | 0.9451 | 0.7857 | 0.9167 | 0.8462 | 0.7049 | 0.7127 |
6 | 0.8846 | 0.9349 | 0.8462 | 0.9167 | 0.8800 | 0.7692 | 0.7715 |
7 | 0.9231 | 0.9941 | 0.9231 | 0.9231 | 0.9231 | 0.8462 | 0.8462 |
8 | 0.9231 | 0.9763 | 0.9231 | 0.9231 | 0.9231 | 0.8462 | 0.8462 |
9 | 0.9231 | 0.9675 | 0.8462 | 1.0000 | 0.9167 | 0.8462 | 0.8563 |
Mean | 0.9246 | 0.9744 | 0.9176 | 0.9339 | 0.9237 | 0.8493 | 0.8530 |
SD | 0.0336 | 0.0222 | 0.0694 | 0.0363 | 0.0359 | 0.0670 | 0.0660 |
tuned_et
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=60, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=175, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
The plot_model
function provides tools to further analyze the performance of a model. It takes the model as input and returns a specified plot. Let’s go over some examples.
plot_model(tuned_et)
As we can see, The AUC for each classes is nearly perfect: 0.99.
plot_model(tuned_et, plot = 'confusion_matrix')
The Confusion Matrix also shows our model does a great job in classifiying the two classes.
# Precision Recall Curve
plot_model(tuned_et, plot = 'pr')
The PR curve got an average precision around 0.99, which is almost perfect.
plot_model(tuned_et, plot = 'class_report')
The model has done a great work on the metrics for both of the classes, having a F1 score of roughly 94%.
plot_model(tuned_et, plot='feature')
The Feature Importance plot above clearly shows us the how each features affect the outcome of our classes. The body mass and the bill size of the penguins are the most important features in our dataset. This agrees with our intuition that male penguins tend to be heavier and have bigger bill sizes.
Alternatively, we can use the evaluate_model
function that will create a user interface with all available plots for a given model.
evaluate_model(tuned_et)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…
The interpret_model
function will return an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.
# Interpret the model using SHAP values
interpret_model(tuned_et)
The test consists of remaining 20% of data that PyCaret automatically split on the setup, it's important to see that the model is not overfitting.
# Make predictions on the test set
predict_model(tuned_et)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Extra Trees Classifier | 0.9403 | 0.9902 | 0.9118 | 0.9688 | 0.9394 | 0.8807 | 0.8822 |
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | species_Adelie | species_Chinstrap | species_Gentoo | island_Biscoe | island_Dream | island_Torgersen | sex | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.221082 | -0.287276 | -0.712196 | -1.190361 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 | 0 | 0.0514 |
1 | 0.019633 | 1.289308 | -0.354918 | 0.239977 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1 | 1 | 0.9771 |
2 | -0.584712 | 0.628159 | -0.426373 | -0.381909 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1 | 1 | 0.8971 |
3 | 0.678920 | -1.100997 | 1.074194 | 0.675297 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0 | 0 | 0.3771 |
4 | -0.987609 | -0.083846 | -0.926562 | -1.625681 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
62 | -0.529772 | 0.729875 | -0.855107 | -1.097078 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 0 | 0.3429 |
63 | 0.221082 | -0.388992 | 1.574384 | 2.167824 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1 | 1 | 0.9486 |
64 | -0.419891 | -1.253570 | 0.645461 | 0.613109 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0 | 0 | 0.0057 |
65 | 0.862055 | -0.744994 | 0.502550 | 1.421560 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1 | 1 | 0.8686 |
66 | 0.276022 | -0.083846 | -0.354918 | -0.879418 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 | 0 | 0.0686 |
67 rows × 13 columns
# Finalize the model
finalize_model(tuned_et)
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=60, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=175, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
PyCaret allows us to save all the pipeline experiment so that the model is ready to be deployed. It's recommended to include the date of the experiment in the file name.
# Save the model
save_model(tuned_et, 'et_model_05082020')
Transformation Pipeline and Model Succesfully Saved
If you have any queries or suggestions, feel free to contact me on LinkedIn