Regression Tutorial (REG101) - Level Beginner

Created using: PyCaret 2.2
Date Updated: November 25, 2020

1.0 Tutorial Objective

Welcome to Regression Tutorial (REG101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with Regression using the pycaret.regression Module.

In this tutorial we will learn:

  • Getting Data: How to import data from PyCaret repository
  • Setting up Environment: How to setup an experiment in PyCaret and get started with building regression models
  • Create Model: How to create a model, perform cross validation and evaluate regression metrics
  • Tune Model: How to automatically tune the hyperparameters of a regression model
  • Plot Model: How to analyze model performance using various plots
  • Finalize Model: How to finalize the best model at the end of the experiment
  • Predict Model: How to make prediction on new / unseen data
  • Save / Load Model: How to save / load a model for future use

Read Time : Approx. 30 Minutes

1.1 Installing PyCaret

The first step to get started with PyCaret is to install PyCaret. Installation is easy and will only take a few minutes. Follow the instructions below:

Installing PyCaret in Local Jupyter Notebook

pip install pycaret

Installing PyCaret on Google Colab or Azure Notebooks

!pip install pycaret

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from pycaret's repository
  • Basic Knowledge of Regression

1.3 For Google Colab Users:

If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab
enable_colab()

1.4 See also:

2.0 What is Regression?

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable', or 'target') and one or more independent variables (often called 'features', 'predictors', or 'covariates'). The objective of regression in machine learning is to predict continuous values such as sales amount, quantity, temperature etc.

Learn More about Regression

3.0 Overview of the Regression Module in PyCaret

PyCaret's Regression module (pycaret.regression) is a supervised machine learning module which is used for predicting continuous values / outcomes using various techniques and algorithms. Regression can be used for predicting values / outcomes such as sales, units sold, temperature or any number which is continuous.

PyCaret's regression module has over 25 algorithms and 10 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's regression module has it all.

4.0 Dataset for the Tutorial

For this tutorial we will use a dataset based on a case study called "Sarah Gets a Diamond". This case was presented in the first year decision analysis course at Darden School of Business (University of Virginia). The basis for the data is a case regarding a hopeless romantic MBA student choosing the right diamond for his bride-to-be, Sarah. The data contains 6000 records for training. Short descriptions of each column are as follows:

  • ID: Uniquely identifies each observation (diamond)
  • Carat Weight: The weight of the diamond in metric carats. One carat is equal to 0.2 grams, roughly the same weight as a paperclip
  • Cut: One of five values indicating the cut of the diamond in the following order of desirability (Signature-Ideal, Ideal, Very Good, Good, Fair)
  • Color: One of six values indicating the diamond's color in the following order of desirability (D, E, F - Colorless, G, H, I - Near colorless)
  • Clarity: One of seven values indicating the diamond's clarity in the following order of desirability (F - Flawless, IF - Internally Flawless, VVS1 or VVS2 - Very, Very Slightly Included, or VS1 or VS2 - Very Slightly Included, SI1 - Slightly Included)
  • Polish: One of four values indicating the diamond's polish (ID - Ideal, EX - Excellent, VG - Very Good, G - Good)
  • Symmetry: One of four values indicating the diamond's symmetry (ID - Ideal, EX - Excellent, VG - Very Good, G - Good)
  • Report: One of of two values "AGSL" or "GIA" indicating which grading agency reported the qualities of the diamond qualities
  • Price: The amount in USD that the diamond is valued Target Column

Dataset Acknowledgement:

This case was prepared by Greg Mills (MBA ’07) under the supervision of Phillip E. Pfeifer, Alumni Research Professor of Business Administration. Copyright (c) 2007 by the University of Virginia Darden School Foundation, Charlottesville, VA. All rights reserved.

The original dataset and description can be found here.

5.0 Getting the Data

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using the get_data() function (This will require internet connection).

In [1]:
from pycaret.datasets import get_data
dataset = get_data('diamond')
Carat Weight Cut Color Clarity Polish Symmetry Report Price
0 1.10 Ideal H SI1 VG EX GIA 5169
1 0.83 Ideal H VS1 ID ID AGSL 3470
2 0.85 Ideal H SI1 EX EX GIA 3183
3 0.91 Ideal E SI1 VG VG GIA 4370
4 0.83 Ideal G SI1 EX EX GIA 3171
In [2]:
#check the shape of data
dataset.shape
Out[2]:
(6000, 8)

In order to demonstrate the predict_model() function on unseen data, a sample of 600 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 600 records are not available at the time when the machine learning experiment was performed.

In [3]:
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (5400, 8)
Unseen Data For Predictions: (600, 8)

6.0 Setting up Environment in PyCaret

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

In later tutorials we will learn how to overwrite PyCaret's inferred data type using the numeric_features and categorical_features parameters in setup().

In [4]:
from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'Price', session_id=123) 
  Description Value
0 session_id 123
1 Target Price
2 Original Data (5400, 8)
3 Missing Values False
4 Numeric Features 1
5 Categorical Features 6
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (3779, 28)
10 Transformed Test Set (1621, 28)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI c90d
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox

Once the setup has been successfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup() is executed. The majority of these features are out of scope for the purposes of this tutorial. However, a few important things to note at this stage include:

  • session_id : A pseudo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.

  • Original Data : Displays the original shape of dataset. In this experiment (5400, 8) means 5400 samples and 8 features including the target column.

  • Missing Values : When there are missing values in the original data, this will show as True. For this experiment there are no missing values in the dataset.

  • Numeric Features : Number of features inferred as numeric. In this dataset, 1 out of 8 features are inferred as numeric.

  • Categorical Features : Number of features inferred as categorical. In this dataset, 6 out of 8 features are inferred as categorical.

  • Transformed Train Set : Displays the shape of the transformed training set. Notice that the original shape of (5400, 8) is transformed into (3779, 28) for the transformed train set. The number of features has increased from 8 from 28 due to categorical encoding

  • Transformed Test Set : Displays the shape of transformed test/hold-out set. There are 1621 samples in test/hold-out set. This split is based on the default value of 70/30 that can be changed using train_size parameter in setup.

Notice how a few tasks that are imperative to perform modeling are automatically handled, such as missing value imputation (in this case there are no missing values in training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in setup() are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.

7.0 Comparing All Models

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using k-fold cross validation for metric evaluation. The output prints a score grid that shows average MAE, MSE, RMSE, R2, RMSLE and MAPE accross the folds (10 by default) along with training time.

In [5]:
best = compare_models(exclude = ['ransac'])
  Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
et Extra Trees Regressor 762.0118 2763999.1585 1612.2410 0.9729 0.0817 0.0607 0.1340
xgboost Extreme Gradient Boosting 708.8427 2799609.2534 1607.9791 0.9724 0.0743 0.0541 0.1600
rf Random Forest Regressor 760.6304 2929683.1860 1663.0148 0.9714 0.0818 0.0597 0.1260
lightgbm Light Gradient Boosting Machine 752.6446 3056347.8515 1687.9907 0.9711 0.0773 0.0567 0.0260
gbr Gradient Boosting Regressor 920.2913 3764303.9252 1901.1793 0.9633 0.1024 0.0770 0.0400
dt Decision Tree Regressor 1003.1237 5305620.3379 2228.7271 0.9476 0.1083 0.0775 0.0100
ridge Ridge Regression 2413.5704 14120492.3795 3726.1643 0.8621 0.6689 0.2875 0.5400
lasso Lasso Regression 2412.1922 14246798.1211 3744.2305 0.8608 0.6767 0.2866 0.5130
llar Lasso Least Angle Regression 2355.6152 14272020.4389 3745.3095 0.8607 0.6391 0.2728 0.0090
br Bayesian Ridge 2415.8031 14270771.8397 3746.9951 0.8606 0.6696 0.2873 0.0100
lr Linear Regression 2418.7036 14279370.2389 3748.9580 0.8604 0.6690 0.2879 1.3240
huber Huber Regressor 1936.1465 18599243.6579 4252.8771 0.8209 0.4333 0.1657 0.0220
par Passive Aggressive Regressor 1944.1634 19955672.9330 4400.2133 0.8083 0.4317 0.1594 0.0150
omp Orthogonal Matching Pursuit 2792.7313 23728654.4124 4829.3171 0.7678 0.5818 0.2654 0.0100
ada AdaBoost Regressor 4232.2217 25201423.0703 5012.4175 0.7467 0.5102 0.5970 0.0380
knn K Neighbors Regressor 2968.0750 29627913.0479 5421.7241 0.7051 0.3664 0.2730 0.0220
catboost CatBoost Regressor 380.6615 1130071.5563 787.8300 0.5899 0.0407 0.0299 0.3640
en Elastic Net 5029.5913 56399795.8780 7467.6598 0.4472 0.5369 0.5845 0.0100
dummy Dummy Regressor 7280.3308 101221941.4046 10032.1624 -0.0014 0.7606 0.8969 0.0100
lar Least Angle Regression 11020.5511 1563301113.8194 20750.5953 -16.8045 0.8989 1.5590 0.0090

Two simple words of code (not even a line) have trained and evaluated over 20 models using cross validation. The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using R2 (highest to lowest) which can be changed by passing sort parameter. For example compare_models(sort = 'RMSLE') will sort the grid by RMSLE (lower to higher since lower is better). If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time. By default, compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameter. </br>

Notice that how exclude parameter is used to block certain models (in this case RANSAC).

8.0 Create a Model

create_model is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using cross validation that can be set with fold parameter. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold.

For the remaining part of this tutorial, we will work with the below models as our candidate models. The selections are for illustration purposes only and do not necessarily mean they are the top performing or ideal for this type of data.

  • AdaBoost Regressor ('ada')
  • Light Gradient Boosting Machine ('lightgbm')
  • Decision Tree ('dt')

There are 25 regressors available in the model library of PyCaret. To see list of all regressors either check the docstring or use models function to see the library.

In [6]:
models()
Out[6]:
Name Reference Turbo
ID
lr Linear Regression sklearn.linear_model._base.LinearRegression True
lasso Lasso Regression sklearn.linear_model._coordinate_descent.Lasso True
ridge Ridge Regression sklearn.linear_model._ridge.Ridge True
en Elastic Net sklearn.linear_model._coordinate_descent.Elast... True
lar Least Angle Regression sklearn.linear_model._least_angle.Lars True
llar Lasso Least Angle Regression sklearn.linear_model._least_angle.LassoLars True
omp Orthogonal Matching Pursuit sklearn.linear_model._omp.OrthogonalMatchingPu... True
br Bayesian Ridge sklearn.linear_model._bayes.BayesianRidge True
ard Automatic Relevance Determination sklearn.linear_model._bayes.ARDRegression False
par Passive Aggressive Regressor sklearn.linear_model._passive_aggressive.Passi... True
ransac Random Sample Consensus sklearn.linear_model._ransac.RANSACRegressor False
tr TheilSen Regressor sklearn.linear_model._theil_sen.TheilSenRegressor False
huber Huber Regressor sklearn.linear_model._huber.HuberRegressor True
kr Kernel Ridge sklearn.kernel_ridge.KernelRidge False
svm Support Vector Regression sklearn.svm._classes.SVR False
knn K Neighbors Regressor sklearn.neighbors._regression.KNeighborsRegressor True
dt Decision Tree Regressor sklearn.tree._classes.DecisionTreeRegressor True
rf Random Forest Regressor sklearn.ensemble._forest.RandomForestRegressor True
et Extra Trees Regressor sklearn.ensemble._forest.ExtraTreesRegressor True
ada AdaBoost Regressor sklearn.ensemble._weight_boosting.AdaBoostRegr... True
gbr Gradient Boosting Regressor sklearn.ensemble._gb.GradientBoostingRegressor True
mlp MLP Regressor sklearn.neural_network._multilayer_perceptron.... False
xgboost Extreme Gradient Boosting xgboost.sklearn.XGBRegressor True
lightgbm Light Gradient Boosting Machine lightgbm.sklearn.LGBMRegressor True
catboost CatBoost Regressor catboost.core.CatBoostRegressor True
dummy Dummy Regressor sklearn.dummy.DummyRegressor True

8.1 AdaBoost Regressor

In [7]:
ada = create_model('ada')
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 4101.8809 23013830.0177 4797.2732 0.7473 0.4758 0.5470
1 4251.5693 29296751.6657 5412.6474 0.7755 0.4940 0.5702
2 4047.8474 22291660.1785 4721.4045 0.7955 0.5068 0.5871
3 4298.3867 23482783.6839 4845.9038 0.7409 0.5089 0.5960
4 3888.5584 24461807.7242 4945.8880 0.6949 0.4764 0.5461
5 4566.4889 29733914.8752 5452.8813 0.7462 0.5462 0.6598
6 4628.7271 27841092.1974 5276.4659 0.7384 0.5549 0.6676
7 4316.4317 25979752.0083 5097.0336 0.6715 0.5034 0.5858
8 3931.2163 21097072.3513 4593.1549 0.7928 0.4858 0.5513
9 4291.1097 24815566.0009 4981.5225 0.7637 0.5495 0.6592
Mean 4232.2217 25201423.0703 5012.4175 0.7467 0.5102 0.5970
Std 233.2282 2804219.3826 277.6577 0.0375 0.0284 0.0457
In [8]:
print(ada)
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=123)

8.2 Light Gradient Boosting Machine

In [9]:
lightgbm = create_model('lightgbm')
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 625.1813 1051762.9578 1025.5550 0.9885 0.0715 0.0526
1 797.6185 5638866.1771 2374.6297 0.9568 0.0727 0.0537
2 829.4586 3328375.4390 1824.3836 0.9695 0.0860 0.0619
3 720.3923 1697211.3816 1302.7707 0.9813 0.0714 0.0554
4 645.6800 1799949.1196 1341.6218 0.9775 0.0745 0.0534
5 830.7176 6423604.0184 2534.4830 0.9452 0.0810 0.0567
6 799.9136 3353992.2636 1831.3908 0.9685 0.0793 0.0585
7 714.3607 1930222.6458 1389.3245 0.9756 0.0732 0.0556
8 784.7648 2211933.1546 1487.2569 0.9783 0.0766 0.0582
9 778.3590 3127561.3571 1768.4913 0.9702 0.0872 0.0609
Mean 752.6446 3056347.8515 1687.9907 0.9711 0.0773 0.0567
Std 69.3829 1661349.5128 455.0112 0.0119 0.0055 0.0030

8.3 Decision Tree

In [10]:
dt = create_model('dt')
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 859.1907 2456840.0599 1567.4310 0.9730 0.1016 0.0727
1 1122.9409 9852564.2047 3138.8795 0.9245 0.1102 0.0758
2 911.3452 2803662.6885 1674.4141 0.9743 0.0988 0.0729
3 1002.5575 3926739.3726 1981.6002 0.9567 0.1049 0.0772
4 1167.8154 9751516.1909 3122.7418 0.8784 0.1226 0.0876
5 1047.7778 7833770.7037 2798.8874 0.9331 0.1128 0.0791
6 1010.0816 3989282.4802 1997.3188 0.9625 0.1106 0.0803
7 846.8085 2182534.9007 1477.3405 0.9724 0.0933 0.0709
8 1001.8451 4904945.0821 2214.7111 0.9518 0.1053 0.0734
9 1060.8742 5354347.6956 2313.9463 0.9490 0.1230 0.0847
Mean 1003.1237 5305620.3379 2228.7271 0.9476 0.1083 0.0775
Std 100.2165 2734194.7557 581.7181 0.0280 0.0091 0.0052

Notice that the Mean score of all models matches with the score printed in compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores across all CV folds. Similar to compare_models(), if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For Example: create_model('dt', fold = 5) to create Decision Tree using 5 fold cross validation.

9.0 Tune a Model

When a model is created using the create_model function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model function is used. This function automatically tunes the hyperparameters of a model using Random Grid Search on a pre-defined search space. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold. To use the custom search grid, you can pass custom_grid parameter in the tune_model function (see 9.2 LightGBM tuning below).

9.1 AdaBoost Regressor

In [11]:
tuned_ada = tune_model(ada)
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 2629.7158 16222922.0054 4027.7689 0.8219 0.2553 0.2244
1 2764.7250 25273189.9003 5027.2448 0.8063 0.2714 0.2357
2 2605.9909 16883405.3119 4108.9421 0.8451 0.2617 0.2352
3 2588.0395 14475338.1062 3804.6469 0.8403 0.2685 0.2271
4 2403.7173 13602075.2435 3688.0991 0.8303 0.2672 0.2223
5 2538.7416 20724600.2592 4552.4280 0.8231 0.2644 0.2260
6 2720.2195 19796302.1522 4449.3036 0.8140 0.2644 0.2280
7 2707.6016 17084596.1502 4133.3517 0.7839 0.2743 0.2475
8 2444.0262 16340453.5625 4042.3327 0.8395 0.2623 0.2199
9 2545.6132 19267454.7853 4389.4709 0.8165 0.2680 0.2247
Mean 2594.8391 17967033.7477 4222.3589 0.8221 0.2657 0.2291
Std 111.1423 3238932.6224 372.4506 0.0174 0.0051 0.0078
In [12]:
print(tuned_ada)
AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear',
                  n_estimators=90, random_state=123)

9.2 Light Gradient Boosting Machine

In [13]:
import numpy as np
lgbm_params = {'num_leaves': np.arange(10,200,10),
                        'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
                        'learning_rate': np.arange(0.1,1,0.1)
                        }
In [14]:
tuned_lightgbm = tune_model(lightgbm, custom_grid = lgbm_params)
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 649.2541 1131046.4835 1063.5067 0.9876 0.0721 0.0544
1 785.8158 5518411.7880 2349.1300 0.9577 0.0730 0.0522
2 808.0977 3024520.4058 1739.1148 0.9723 0.0836 0.0597
3 749.7881 1774260.2775 1332.0136 0.9804 0.0724 0.0556
4 694.0351 1974576.4174 1405.1962 0.9754 0.0838 0.0585
5 841.6462 6725524.0654 2593.3615 0.9426 0.0824 0.0582
6 796.0240 3324498.6208 1823.3208 0.9688 0.0774 0.0564
7 713.1006 1872493.1136 1368.3907 0.9763 0.0715 0.0551
8 775.9760 2274682.3424 1508.2050 0.9777 0.0766 0.0579
9 768.3451 3247098.5445 1801.9707 0.9691 0.0885 0.0594
Mean 758.2083 3086711.2059 1698.4210 0.9708 0.0781 0.0567
Std 54.9147 1678033.7774 449.5301 0.0120 0.0058 0.0023
In [15]:
print(tuned_lightgbm)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=60,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=120, objective=None,
              random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

9.3 Decision Tree

In [16]:
tuned_dt = tune_model(dt)
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 1000.7122 2895159.1309 1701.5167 0.9682 0.1076 0.0828
1 1080.2841 6686388.0416 2585.8051 0.9488 0.1053 0.0814
2 1002.3163 3275429.6329 1809.8148 0.9700 0.1051 0.0812
3 1080.7850 4037154.5985 2009.2672 0.9555 0.1172 0.0870
4 1101.6333 7889520.5391 2808.8290 0.9016 0.1189 0.0842
5 1275.5901 11021312.1970 3319.8362 0.9059 0.1250 0.0895
6 1068.6534 4463866.3029 2112.7864 0.9581 0.1076 0.0809
7 975.9364 3271028.5175 1808.5985 0.9586 0.1099 0.0807
8 1101.9207 4441966.3616 2107.5973 0.9564 0.1114 0.0873
9 1065.1662 5192339.2748 2278.6705 0.9506 0.1224 0.0873
Mean 1075.2997 5317416.4597 2254.2722 0.9474 0.1130 0.0842
Std 79.0463 2416581.2427 485.4621 0.0227 0.0069 0.0031

By default, tune_model optimizes R2 but this can be changed using optimize parameter. For example: tune_model(dt, optimize = 'MAE') will search for the hyperparameters of a Decision Tree Regressor that results in the lowest MAE instead of highest R2. For the purposes of this example, we have used the default metric R2 for the sake of simplicity only. The methodology behind selecting the right metric to evaluate a regressor is beyond the scope of this tutorial but if you would like to learn more about it, you can click here to develop an understanding on regression error metrics.

Metrics alone are not the only criteria you should consider when finalizing the best model for production. Other factors to consider include training time, standard deviation of k-folds etc. As you progress through the tutorial series we will discuss those factors in detail at the intermediate and expert levels. For now, let's move forward considering the Tuned Light Gradient Boosting Machine stored in the tuned_lightgbm variable as our best model for the remainder of this tutorial.

10.0 Plot a Model

Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as Residuals Plot, Prediction Error, Feature Importance etc. This function takes a trained model object and returns a plot based on the test / hold-out set.

There are over 10 plots available, please see the plot_model() docstring for the list of available plots.

10.1 Residual Plot

In [17]:
plot_model(tuned_lightgbm)

10.2 Prediction Error Plot

In [18]:
plot_model(tuned_lightgbm, plot = 'error')

10.3 Feature Importance Plot

In [19]:
plot_model(tuned_lightgbm, plot='feature')

Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [20]:
evaluate_model(tuned_lightgbm)

11.0 Predict on Test / Hold-out Sample

Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. If you look at the information grid in Section 6 above, you will see that 30% (1621 samples) of the data has been separated out as a test/hold-out sample. All of the evaluation metrics we have seen above are cross-validated results based on training set (70%) only. Now, using our final trained model stored in the tuned_lightgbm variable we will predict the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.

In [21]:
predict_model(tuned_lightgbm);
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Light Gradient Boosting Machine 781.5572 3816757.2761 1953.6523 0.9652 0.0787 0.0558

The R2 on the test/hold-out set is 0.9652 compared to 0.9708 achieved on tuned_lightgbm CV results (in section 9.2 above). This is not a significant difference. If there is a large variation between the test/hold-out and CV results, then this would normally indicate over-fitting but could also be due to several other factors and would require further investigation. In this case, we will move forward with finalizing the model and predicting on unseen data (the 10% that we had separated in the beginning and never exposed to PyCaret).

(TIP : It's always good to look at the standard deviation of CV results when using create_model.)

12.0 Finalize Model for Deployment

Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.

In [22]:
final_lightgbm = finalize_model(tuned_lightgbm)
In [23]:
print(final_lightgbm)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=60,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=120, objective=None,
              random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Caution: One final word of caution. Once the model is finalized using finalize_model(), the entire dataset including the test/hold-out set is used for training. As such, if the model is used for predictions on the hold-out set after finalize_model() is used, the information grid printed will be misleading as you are trying to predict on the same data that was used for modeling. In order to demonstrate this point only, we will use final_lightgbm under predict_model() to compare the information grid with the one above in section 11.

In [24]:
predict_model(final_lightgbm);
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Light Gradient Boosting Machine 459.9160 1199892.0334 1095.3958 0.9891 0.0498 0.0362

Notice how the R2 in the final_lightgbm has increased to 0.9891 from 0.9652, even though the model is same. This is because the final_lightgbm variable is trained on the complete dataset including the test/hold-out set.

13.0 Predict on Unseen Data

The predict_model() function is also used to predict on the unseen dataset. The only difference from section 11 above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning of the tutorial and contains 10% (600 samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)

In [25]:
unseen_predictions = predict_model(final_lightgbm, data=data_unseen)
unseen_predictions.head()
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Light Gradient Boosting Machine 707.9033 2268889.5439 1506.2834 0.9779 0.0696 0.0513
Out[25]:
Carat Weight Cut Color Clarity Polish Symmetry Report Price Label
0 1.53 Ideal E SI1 ID ID AGSL 12791 12262.949782
1 1.50 Fair F SI1 VG VG GIA 10450 10122.442382
2 1.01 Good E SI1 G G GIA 5161 5032.520456
3 2.51 Very Good G VS2 VG VG GIA 34361 34840.379469
4 1.01 Good I SI1 VG VG GIA 4238 4142.695964

The Label column is added onto the data_unseen set. Label is the predicted value using the final_lightgbm model. If you want predictions to be rounded, you can use round parameter inside predict_model(). You can also check the metrics on this since you have actual target column Price available. To do that we will use pycaret.utils module. See example below:

In [26]:
from pycaret.utils import check_metric
check_metric(unseen_predictions.Price, unseen_predictions.Label, 'R2')
Out[26]:
0.9779

14.0 Saving the Model

We have now finished the experiment by finalizing the tuned_lightgbm model which is now stored in final_lightgbm variable. We have also used the model stored in final_lightgbm to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [27]:
save_model(final_lightgbm,'Final LightGBM Model 25Nov2020')
Transformation Pipeline and Model Successfully Saved
Out[27]:
(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[], target='Price',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strategy='...
                  LGBMRegressor(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.1, max_depth=60,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_jobs=-1,
                                num_leaves=120, objective=None, random_state=123,
                                reg_alpha=0.0, reg_lambda=0.0, silent='warn',
                                subsample=1.0, subsample_for_bin=200000,
                                subsample_freq=0)]],
          verbose=False),
 'Final LightGBM Model 25Nov2020.pkl')

(TIP : It's always good to use date in the filename when saving models, it's good for version control.)

15.0 Loading the Saved Model

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [28]:
saved_final_lightgbm = load_model('Final LightGBM Model 25Nov2020')
Transformation Pipeline and Model Successfully Loaded

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below we have applied the loaded model to predict the same data_unseen that we used in section 13 above.

In [29]:
new_prediction = predict_model(saved_final_lightgbm, data=data_unseen)
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Light Gradient Boosting Machine 707.9033 2268889.5439 1506.2834 0.9779 0.0696 0.0513
In [30]:
new_prediction.head()
Out[30]:
Carat Weight Cut Color Clarity Polish Symmetry Report Price Label
0 1.53 Ideal E SI1 ID ID AGSL 12791 12262.949782
1 1.50 Fair F SI1 VG VG GIA 10450 10122.442382
2 1.01 Good E SI1 G G GIA 5161 5032.520456
3 2.51 Very Good G VS2 VG VG GIA 34361 34840.379469
4 1.01 Good I SI1 VG VG GIA 4238 4142.695964

Notice that the results of unseen_predictions and new_prediction are identical.

In [31]:
from pycaret.utils import check_metric
check_metric(new_prediction.Price, new_prediction.Label, 'R2')
Out[31]:
0.9779

16.0 Wrap-up / Next Steps?

This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction and saving the model for later use. We have completed all of these steps in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model(), tune_model(), compare_models(). Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most libraries.

We have only covered the basics of pycaret.regression. In following tutorials we will go deeper into advanced pre-processing, ensembling, generalized stacking and other techniques that allow you to fully customize your machine learning pipeline and are must know for any data scientist.

See you at the next tutorial. Follow the link to Regression Tutorial (REG102) - Level Intermediate