Regression Tutorial (REG102) - Level Intermediate

Created using: PyCaret 2.2
Date Updated: November 25, 2020

1.0 Tutorial Objective

Welcome to the regression tutorial (REG102) - Level Intermediate. This tutorial assumes that you have completed Regression Tutorial (REG101) - Level Beginner. If you haven't used PyCaret before and this is your first tutorial, we strongly recommend you to go back and progress through the beginner tutorial to understand the basics of working in PyCaret.

In this tutorial we will use the pycaret.regression module to learn:

  • Normalization: How to normalize and scale the dataset
  • Transformation: How to apply transformations that make the data linear and approximately normal
  • Target Transformation: How to apply transformations to the target variable
  • Combine Rare Levels: How to combine rare levels in categorical features
  • Bin Numeric Variables: How to bin numeric variables and transform numeric features into categorical ones using 'sturges' rule
  • Model Ensembling and Stacking: How to boost model performance using several ensembling techniques such as Bagging, Boosting, Voting and Generalized Stacking.
  • Experiment Logging: How to log experiments in PyCaret using MLFlow backend

Read Time : Approx 60 Minutes

1.1 Installing PyCaret

If you haven't installed PyCaret yet. Please follow the link to Beginner's Tutorial for instructions on how to install pycaret.

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from pycaret's repository
  • Completion of Regression Tutorial (REG101) - Level Beginner

1.3 For Google colab users:

If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab

1.4 See also:

2.0 Brief overview of techniques covered in this tutorial

Before we into the practical execution of the techniques mentioned above in Section 1, it is important to understand what are these techniques are and when to use them. More often than not most of these techniques will help linear and parametric algorithms, however it is not surprising to also see performance gains in tree-based models. The Below explanations are only brief and we recommend that you do extra reading to dive deeper and get a more thorough understanding of these techniques.

  • Normalization: Normalization / Scaling (often used interchangeably with standardization) is used to transform the actual values of numeric variables in a way that provides helpful properties for machine learning. Many algorithms such as Linear Regression, Support Vector Machine and K Nearest Neighbors assume that all features are centered around zero and have variances that are at the same level of order. If a particular feature in a dataset has a variance that is larger in order of magnitude than other features, the model may not understand all features correctly and could perform poorly. Read more

  • Transformation: While normalization transforms the range of data to remove the impact of magnitude in variance, transformation is a more radical technique as it changes the shape of the distribution so that transformed data can be represented by a normal or approximate normal distirbution. In general, you should transform the data if using algorithms that assume normality or a gaussian distribution. Examples of such models are Linear Regression, Lasso Regression and Ridge Regression. Read more

  • Target Transformation: This is similar to the transformation technique explained above with the exception that this is only applied to the target variable. Read more to understand the effects of transforming the target variable in regression.

  • Combine Rare Levels: Sometimes categorical features have levels that are insignificant in the frequency distribution. As such, they may introduce noise into the dataset due to a limited sample size for learning. One way to deal with rare levels in categorical features is to combine them into a new class.

  • Bin Numeric Variables: Binning or discretization is the process of transforming numerical variables into categorical features. An example would be Carat Weight in this experiment. It is a continious distribution of numeric values that can be discretized into intervals. Binning may improve the accuracy of a predictive model by reducing the noise or non-linearity in the data. PyCaret automatically determines the number and size of bins using Sturges rule. Read more

  • Model Ensembling and Stacking: Ensemble modeling is a process where multiple diverse models are created to predict an outcome. This is achieved either by using many different modeling algorithms or using different samples of training data sets. The ensemble model then aggregates the predictions of each base model resulting in one final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error of the model decreases when the ensemble approach is used. The two most common methods in ensemble learning are Bagging and Boosting. Stacking is also a type of ensemble learning where predictions from multiple models are used as input features for a meta model that predicts the final outcome. Read more

  • Tuning Hyperparameters of Ensemblers: Similar to hyperparameter tuning for a single machine learning model, we will also learn how to tune hyperparameters for an ensemble model.

3.0 Dataset for the Tutorial

For this tutorial we will be using the same dataset that was used in Regression Tutorial (REG101) - Level Beginner.

Dataset Acknowledgements:

This case was prepared by Greg Mills (MBA ’07) under the supervision of Phillip E. Pfeifer, Alumni Research Professor of Business Administration. Copyright (c) 2007 by the University of Virginia Darden School Foundation, Charlottesville, VA. All rights reserved.

The original dataset and description can be found here.

4.0 Getting the Data

You can download the data from the original source found here and load it using the pandas read_csv function or you can use PyCaret's data respository to load the data using the get_data function (This will require internet connection).

In [1]:
from pycaret.datasets import get_data
dataset = get_data('diamond', profile=True)