Binary Classification Tutorial (CLF102) - Level Intermediate

Created using: PyCaret 2.2
Date Updated: November 20, 2020

1.0 Tutorial Objective

Welcome to the Binary Classification Tutorial (CLF102) - Level Intermediate. This tutorial assumes that you have completed Binary Classification Tutorial (CLF101) - Level Beginner. If you haven't used PyCaret before and this is your first tutorial, we strongly recommend you to go back and progress through the beginner tutorial to understand the basics of working in PyCaret.

In this tutorial we will use the pycaret.classification module to learn:

  • Normalization: How to normalize and scale the dataset
  • Transformation: How to apply transformations that make the data linear and approximately normal
  • Ignore Low Variance: How to remove features with statistically insignificant variances to make the experiment more efficient
  • Remove Multi-collinearity: How to remove multi-collinearity from the dataset to boost performance of Linear algorithms
  • Group Features: How to extract statistical information from related features in the dataset
  • Bin Numeric Variables: How to bin numeric variables and transform numeric features into categorical ones using 'sturges' rule
  • Model Ensembling and Stacking: How to boost model performance using several ensembling techniques such as Bagging, Boosting, Soft/hard Voting and Generalized Stacking
  • Model Calibration: How to calibrate probabilities of a classification model
  • Experiment Logging: How to log experiments in PyCaret using MLFlow backend

Read Time : Approx 60 Minutes

1.1 Installing PyCaret

If you haven't installed PyCaret yet, please follow the link to Beginner's Tutorial for instructions on how to install.

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from pycaret's repository
  • Completion of Binary Classification Tutorial (CLF101) - Level Beginner

1.3 For Google colab users:

If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab

1.4 See also:

2.0 Brief overview of techniques covered in this tutorial

Before we get into the practical execution of the techniques mentioned above in the Section 1, it is important to understand what these techniques are and when to use them. More often than not most of these techniques will help linear and parametric algorithms, however it is not suprising to also see performance gains in tree-based models. The below explanations are only brief and we recommend that you to do extra reading to dive deeper and get a more thorough understanding of these techniques.

  • Normalization: Normalization / Scaling (often used interchangeably with standardization) is used to transform the actual values of numeric variables in a way that provides helpful properties for machine learning. Many algorithms such as Logistic Regression, Support Vector Machine, K Nearest Neighbors and Naive Bayes assume that all features are centered around zero and have variances that are at the same level of order. If a particular feature in a dataset has a variance that is larger in order of magnitude than other features, the model may not understand all features correctly and could perform poorly. For instance, in the dataset we are using for this example the AGE feature ranges between 21 to 79 while other numeric features range from 10,000 to 1,000,000. Read more

  • Transformation: While normalization transforms the range of data to remove the impact of magnitude in variance, transformation is a more radical technique as it changes the shape of the distribution so that transformed data can be represented by a normal or approximate normal distirbution. In general, you should transform the data if using algorithms that assume normality or a gaussian distribution. Examples of such models are Logistic Regression, Linear Discriminant Analysis (LDA) and Gaussian Naive Bayes. (Pro tip: any method with “Gaussian” in the name probably assumes normality.) Read more

  • Ignore Low Variance: Datasets can sometimes contain categorical features that have a single unique or small number of values across samples. This kind of features are not only non-informative and add no value but are also sometimes harmful for few algorithms. Imagine a feature with only one unique value or few dominant unique values accross samples, they can be removed from the dataset by using the ignore low variance feature in PyCaret.

  • Multi-collinearity: Multi-collinearity is a state of very high intercorrelations or inter-associations among the independent features in the dataset. It is a type of disturbance in the data that is not handled well by machine learning models (mostly linear algorithms). Multi-collinearity may reduce overall coefficient of the model and cause unpredictable variance. This will lead to overfitting where the model may do great on a known training set but will fail with an unknown testing set. Read more

  • Group Features: Sometimes datasets may contain features that are related at a sample level. For example in the credit dataset there are features called BILL_AMT1 .. BILL_AMT6 which are related in such a way that BILL_AMT1 is the amount of the bill 1 month ago and BILL_AMT6 is the amount of the bill 6 months ago. Such features can be used to extract additional features based on the statistical properties of the distribution such as mean, median, variance, standard deviation etc.

  • Bin Numeric Variables: Binning or discretization is the process of transforming numerical variables into categorical features. An example would be the Age variable which is a continious distribution of numeric values that can be discretized into intervals (10-20 years, 21-30 etc.). Binning may improve the accuracy of a predictive model by reducing the noise or non-linearity in the data. PyCaret automatically determines the number and size of bins using Sturges rule. Read more

  • Model Ensembling and Stacking: Ensemble modeling is a process where multiple diverse models are created to predict an outcome. This is achieved either by using many different modeling algorithms or using different samples of training data sets. The ensemble model then aggregates the predictions of each base model resulting in one final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error of the model decreases when the ensemble approach is used. The two most common methods in ensemble learning are Bagging and Boosting. Stacking is also a type of ensemble learning where predictions from multiple models are used as input features for a meta model that predicts the final outcome. Read more

3.0 Dataset for the Tutorial

For this tutorial we will be using the same dataset that was used in Binary Classification Tutorial (CLF101) - Level Beginner

Dataset Acknowledgements:

Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

The original dataset and data dictionary can be found here at the UCI Machine Learning Repository.

4.0 Getting the Data

You can download the data from the original source found here and load it using the pandas read_csv function or you can use PyCaret's data respository to load the data using the get_data function (This will require an internet connection).

In [1]:
from pycaret.datasets import get_data
dataset = get_data('credit', profile=True)