Palmer Penguins Classification using PyCaret 2.0¶

Last Update: 06/08/2020

PyCaret Version: 2.0

The goal of this project is to build a binary classification model to predict the sex of Palmer Penguins using PyCaret 2.0.

About PyCaret¶

PyCaret is an open-source, low-code machine learning library in Python that allows you to prepare and deploy your model with few lines of code.

Installation¶

Installing PyCaret is very easy and takes only a few minutes. If you're using Azure notebooks or Google Colab, simply run the following code to install PyCaret.

In [ ]:

# Install PyCaret 2.0
!pip install pycaret==2.0

Importing the Libraries¶

In [ ]:

# Import pandas for data loading and manipulation
import pandas as pd

# To render interactive plots in Google Colab
from pycaret.utils import enable_colab
enable_colab()

Colab mode activated.

Loading the Data¶

The data used in this project is the Palmer Penguins dataset, which is released as an R package by Allison Horst.

The dataset consists of 4 numerical features:

Bill length (mm)
Bill depth (mm)
Flipper length (mm)
Body mass (g)

And 2 categorical features:

Sex (Male/Female)
Island (Biscoe/Dream/Torgersen)

You can find the cleaned CSV file in my GitHub.

In [ ]:

# Load the data
data = pd.read_csv('penguins.csv')
data.head()

Out[ ]:

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181	3750	Male
1	Adelie	Torgersen	39.5	17.4	186	3800	Female
2	Adelie	Torgersen	40.3	18.0	195	3250	Female
3	Adelie	Torgersen	36.7	19.3	193	3450	Female
4	Adelie	Torgersen	39.3	20.6	190	3650	Male

In [ ]:

# View the data description
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    int64  
 5   body_mass_g        333 non-null    int64  
 6   sex                333 non-null    object 
dtypes: float64(2), int64(2), object(3)
memory usage: 18.3+ KB

Setup¶

To start with PyCaret, the first step is to import all methods and attributes from PyCaret’s classification class.

In [ ]:

from pycaret.classification import *

PyCaret workflow always starts with setup function which prepares the environment for the entire machine learning pipeline. Thus, setup must be executed before any other function.

In [ ]:

clf = setup(
    data=data, 
    target='sex',
    train_size=0.8,
    normalize=True,
    session_id=123
)

Setup Succesfully Completed!

	Description	Value
0	session_id	123
1	Target Type	Binary
2	Label Encoded	Female: 0, Male: 1
3	Original Data	(333, 7)
4	Missing Values	False
5	Numeric Features	4
6	Categorical Features	2
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(333, 7)
11	Transformed Train Set	(266, 10)
12	Transformed Test Set	(67, 10)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	True
16	Normalize Method	zscore
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None
42	Fix Imbalance	False
43	Fix Imbalance Method	SMOTE

As you can see above, the setup function will handle the data preprocessing steps. It also splits the data into training and test sets.

Compare Models¶

Once the setup is executed, we can use compare_models to briefly evaluate the performance of all the models in the model library of PyCaret. This function train all the models available in the model library and scores them using Stratified Cross Validation. The output prints a score grid with Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC (averaged accross folds), determined by fold parameter.

In [ ]:

# Compare and sort models by AUC
compare_models(sort='AUC')

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
0	Extra Trees Classifier	0.9209	0.9720	0.9176	0.9266	0.9194	0.8420	0.8466	0.2486
1	Linear Discriminant Analysis	0.9137	0.9710	0.9027	0.9235	0.9106	0.8274	0.8312	0.0068
2	CatBoost Classifier	0.8981	0.9705	0.9093	0.8919	0.8976	0.7961	0.8015	0.9557
3	Extreme Gradient Boosting	0.9135	0.9689	0.9324	0.9016	0.9152	0.8268	0.8307	0.0253
4	Logistic Regression	0.9023	0.9686	0.9253	0.8879	0.9029	0.8045	0.8110	0.0163
5	K Neighbors Classifier	0.8981	0.9655	0.9093	0.8953	0.8993	0.7960	0.8013	0.0042
6	Light Gradient Boosting Machine	0.8984	0.9645	0.9104	0.8958	0.8998	0.7966	0.8033	0.0229
7	Random Forest Classifier	0.8986	0.9633	0.9022	0.8995	0.8982	0.7969	0.8019	0.1125
8	Gradient Boosting Classifier	0.8870	0.9631	0.8863	0.8896	0.8846	0.7738	0.7794	0.0878
9	Ada Boost Classifier	0.8875	0.9513	0.8874	0.8933	0.8871	0.7743	0.7800	0.0837
10	Decision Tree Classifier	0.8724	0.8728	0.8742	0.8757	0.8728	0.7447	0.7484	0.0055
11	Naive Bayes	0.6691	0.8032	0.6341	0.6886	0.6539	0.3394	0.3451	0.0063
12	Quadratic Discriminant Analysis	0.6722	0.7663	0.6780	0.7273	0.6529	0.3413	0.3876	0.0053
13	SVM - Linear Kernel	0.8947	0.0000	0.9093	0.8963	0.8978	0.7886	0.7984	0.0073
14	Ridge Classifier	0.9135	0.0000	0.9099	0.9171	0.9115	0.8270	0.8301	0.0079

Out[ ]:

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=123, verbose=0,
                     warm_start=False)

Create Model¶

The next step is to create a model with selected algorithm using create_model function. We just need to pass in the abbreviation of the model. You can check the docstring of the function to look for abbreviations.

In [ ]:

# Get the ID of the models
help(create_model)

Help on function create_model in module pycaret.classification:

create_model(estimator=None, ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=True, system=True, **kwargs)
    Description:
    ------------
    This function creates a model and scores it using Stratified Cross Validation. 
    The output prints a score grid that shows Accuracy, AUC, Recall, Precision, 
    F1, Kappa and MCC by fold (default = 10 Fold). 
    
    This function returns a trained model object. 
    
    setup() function must be called before using create_model()
    
        Example
        -------
        from pycaret.datasets import get_data
        juice = get_data('juice')
        experiment_name = setup(data = juice,  target = 'Purchase')
        
        lr = create_model('lr')
    
        This will create a trained Logistic Regression model.
    
    Parameters
    ----------
    estimator : string / object, default = None
    
    Enter ID of the estimators available in model library or pass an untrained model 
    object consistent with fit / predict API to train and evaluate model. All estimators 
    support binary or multiclass problem. List of estimators in model library:
    
    ID          Name      
    --------    ----------     
    'lr'        Logistic Regression             
    'knn'       K Nearest Neighbour            
    'nb'        Naive Bayes             
    'dt'        Decision Tree Classifier                   
    'svm'       SVM - Linear Kernel                 
    'rbfsvm'    SVM - Radial Kernel               
    'gpc'       Gaussian Process Classifier                  
    'mlp'       Multi Level Perceptron                  
    'ridge'     Ridge Classifier                
    'rf'        Random Forest Classifier                   
    'qda'       Quadratic Discriminant Analysis                  
    'ada'       Ada Boost Classifier                 
    'gbc'       Gradient Boosting Classifier                  
    'lda'       Linear Discriminant Analysis                  
    'et'        Extra Trees Classifier                   
    'xgboost'   Extreme Gradient Boosting              
    'lightgbm'  Light Gradient Boosting              
    'catboost'  CatBoost Classifier             
    
    ensemble: Boolean, default = False
    True would result in an ensemble of estimator using the method parameter defined. 
    
    method: String, 'Bagging' or 'Boosting', default = None.
    method must be defined when ensemble is set to True. Default method is set to None. 
    
    fold: integer, default = 10
    Number of folds to be used in Kfold CV. Must be at least 2. 
    
    round: integer, default = 4
    Number of decimal places the metrics in the score grid will be rounded to. 
    
    cross_validation: bool, default = True
    When cross_validation set to False fold parameter is ignored and model is trained
    on entire training dataset. No metric evaluation is returned. 
    
    verbose: Boolean, default = True
    Score grid is not printed when verbose is set to False.
    
    system: Boolean, default = True
    Must remain True all times. Only to be changed by internal functions.
    
    **kwargs: 
    Additional keyword arguments to pass to the estimator.
    
    Returns:
    --------
    
    score grid:   A table containing the scores of the model across the kfolds. 
    -----------   Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, 
                  Kappa and MCC. Mean and standard deviation of the scores across 
                  the folds are highlighted in yellow.
    
    model:        trained model object
    -----------
    
    Warnings:
    ---------
    - 'svm' and 'ridge' doesn't support predict_proba method. As such, AUC will be
      returned as zero (0.0)
     
    - If target variable is multiclass (more than 2 classes), AUC will be returned 
      as zero (0.0)
    
    - 'rbfsvm' and 'gpc' uses non-linear kernel and hence the fit time complexity is 
      more than quadratic. These estimators are hard to scale on datasets with more 
      than 10,000 samples.

In [ ]:

# Create an Extra Trees Classifier	
et = create_model('et')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8889	0.9670	0.9231	0.8571	0.8889	0.7781	0.7802
1	0.9630	0.9945	1.0000	0.9286	0.9630	0.9260	0.9286
2	0.9630	1.0000	1.0000	0.9333	0.9655	0.9256	0.9282
3	0.9259	0.9615	1.0000	0.8750	0.9333	0.8508	0.8605
4	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
5	0.8148	0.9451	0.7143	0.9091	0.8000	0.6322	0.6481
6	0.8846	0.9290	0.8462	0.9167	0.8800	0.7692	0.7715
7	0.9231	0.9941	0.9231	0.9231	0.9231	0.8462	0.8462
8	0.9231	0.9763	0.9231	0.9231	0.9231	0.8462	0.8462
9	0.9231	0.9527	0.8462	1.0000	0.9167	0.8462	0.8563
Mean	0.9209	0.9720	0.9176	0.9266	0.9194	0.8420	0.8466
SD	0.0484	0.0238	0.0888	0.0433	0.0524	0.0962	0.0931

In [ ]:

# Tune the classifier	
tuned_et = tune_model(et, optimize='AUC')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.9259	0.9670	0.9231	0.9231	0.9231	0.8516	0.8516
1	0.9630	1.0000	1.0000	0.9286	0.9630	0.9260	0.9286
2	0.9630	1.0000	1.0000	0.9333	0.9655	0.9256	0.9282
3	0.9259	0.9615	1.0000	0.8750	0.9333	0.8508	0.8605
4	0.9630	0.9973	0.9286	1.0000	0.9630	0.9260	0.9286
5	0.8519	0.9451	0.7857	0.9167	0.8462	0.7049	0.7127
6	0.8846	0.9349	0.8462	0.9167	0.8800	0.7692	0.7715
7	0.9231	0.9941	0.9231	0.9231	0.9231	0.8462	0.8462
8	0.9231	0.9763	0.9231	0.9231	0.9231	0.8462	0.8462
9	0.9231	0.9675	0.8462	1.0000	0.9167	0.8462	0.8563
Mean	0.9246	0.9744	0.9176	0.9339	0.9237	0.8493	0.8530
SD	0.0336	0.0222	0.0694	0.0363	0.0359	0.0670	0.0660

In [ ]:

tuned_et

Out[ ]:

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=60, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=175, n_jobs=-1,
                     oob_score=False, random_state=123, verbose=0,
                     warm_start=False)

Plot the Model¶

The plot_model function provides tools to further analyze the performance of a model. It takes the model as input and returns a specified plot. Let’s go over some examples.

In [ ]:

plot_model(tuned_et)

As we can see, The AUC for each classes is nearly perfect: 0.99.

In [ ]:

plot_model(tuned_et, plot = 'confusion_matrix')

The Confusion Matrix also shows our model does a great job in classifiying the two classes.

In [ ]:

# Precision Recall Curve
plot_model(tuned_et, plot = 'pr')

The PR curve got an average precision around 0.99, which is almost perfect.

In [ ]:

plot_model(tuned_et, plot = 'class_report')

The model has done a great work on the metrics for both of the classes, having a F1 score of roughly 94%.

In [ ]:

plot_model(tuned_et, plot='feature')

The Feature Importance plot above clearly shows us the how each features affect the outcome of our classes. The body mass and the bill size of the penguins are the most important features in our dataset. This agrees with our intuition that male penguins tend to be heavier and have bigger bill sizes.

Evaluate Model¶

Alternatively, we can use the evaluate_model function that will create a user interface with all available plots for a given model.

In [ ]:

evaluate_model(tuned_et)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

Interpret the Model¶

The interpret_model function will return an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

In [ ]:

# Interpret the model using SHAP values
interpret_model(tuned_et)

Predict and Finalize model¶

The test consists of remaining 20% of data that PyCaret automatically split on the setup, it's important to see that the model is not overfitting.

In [ ]:

# Make predictions on the test set
predict_model(tuned_et)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Extra Trees Classifier	0.9403	0.9902	0.9118	0.9688	0.9394	0.8807	0.8822

Out[ ]:

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	species_Adelie	species_Chinstrap	species_Gentoo	island_Biscoe	island_Dream	island_Torgersen	sex	Label	Score
0	0.221082	-0.287276	-0.712196	-1.190361	0.0	1.0	0.0	0.0	1.0	0.0	0	0	0.0514
1	0.019633	1.289308	-0.354918	0.239977	1.0	0.0	0.0	0.0	1.0	0.0	1	1	0.9771
2	-0.584712	0.628159	-0.426373	-0.381909	1.0	0.0	0.0	0.0	1.0	0.0	1	1	0.8971
3	0.678920	-1.100997	1.074194	0.675297	0.0	0.0	1.0	1.0	0.0	0.0	0	0	0.3771
4	-0.987609	-0.083846	-0.926562	-1.625681	1.0	0.0	0.0	0.0	0.0	1.0	0	0	0.0000
...	...	...	...	...	...	...	...	...	...	...	...	...	...
62	-0.529772	0.729875	-0.855107	-1.097078	1.0	0.0	0.0	0.0	0.0	1.0	1	0	0.3429
63	0.221082	-0.388992	1.574384	2.167824	0.0	0.0	1.0	1.0	0.0	0.0	1	1	0.9486
64	-0.419891	-1.253570	0.645461	0.613109	0.0	0.0	1.0	1.0	0.0	0.0	0	0	0.0057
65	0.862055	-0.744994	0.502550	1.421560	0.0	0.0	1.0	1.0	0.0	0.0	1	1	0.8686
66	0.276022	-0.083846	-0.354918	-0.879418	0.0	1.0	0.0	0.0	1.0	0.0	0	0	0.0686

67 rows × 13 columns

In [ ]:

# Finalize the model
finalize_model(tuned_et)

Out[ ]:

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=60, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=175, n_jobs=-1,
                     oob_score=False, random_state=123, verbose=0,
                     warm_start=False)

PyCaret allows us to save all the pipeline experiment so that the model is ready to be deployed. It's recommended to include the date of the experiment in the file name.

In [ ]:

# Save the model
save_model(tuned_et, 'et_model_05082020')

Transformation Pipeline and Model Succesfully Saved

If you have any queries or suggestions, feel free to contact me on LinkedIn