Created using: PyCaret 2.2
Date Updated: November 25, 2020
Welcome to Regression Tutorial (REG101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with Regression using the pycaret.regression
Module.
In this tutorial we will learn:
Read Time : Approx. 30 Minutes
The first step to get started with PyCaret is to install PyCaret. Installation is easy and will only take a few minutes. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable', or 'target') and one or more independent variables (often called 'features', 'predictors', or 'covariates'). The objective of regression in machine learning is to predict continuous values such as sales amount, quantity, temperature etc.
PyCaret's Regression module (pycaret.regression
) is a supervised machine learning module which is used for predicting continuous values / outcomes using various techniques and algorithms. Regression can be used for predicting values / outcomes such as sales, units sold, temperature or any number which is continuous.
PyCaret's regression module has over 25 algorithms and 10 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's regression module has it all.
For this tutorial we will use a dataset based on a case study called "Sarah Gets a Diamond". This case was presented in the first year decision analysis course at Darden School of Business (University of Virginia). The basis for the data is a case regarding a hopeless romantic MBA student choosing the right diamond for his bride-to-be, Sarah. The data contains 6000 records for training. Short descriptions of each column are as follows:
Target Column
This case was prepared by Greg Mills (MBA ’07) under the supervision of Phillip E. Pfeifer, Alumni Research Professor of Business Administration. Copyright (c) 2007 by the University of Virginia Darden School Foundation, Charlottesville, VA. All rights reserved.
The original dataset and description can be found here.
You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using the get_data()
function (This will require internet connection).
from pycaret.datasets import get_data
dataset = get_data('diamond')
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | |
---|---|---|---|---|---|---|---|---|
0 | 1.10 | Ideal | H | SI1 | VG | EX | GIA | 5169 |
1 | 0.83 | Ideal | H | VS1 | ID | ID | AGSL | 3470 |
2 | 0.85 | Ideal | H | SI1 | EX | EX | GIA | 3183 |
3 | 0.91 | Ideal | E | SI1 | VG | VG | GIA | 4370 |
4 | 0.83 | Ideal | G | SI1 | EX | EX | GIA | 3171 |
#check the shape of data
dataset.shape
(6000, 8)
In order to demonstrate the predict_model()
function on unseen data, a sample of 600 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 600 records are not available at the time when the machine learning experiment was performed.
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (5400, 8) Unseen Data For Predictions: (600, 8)
The setup()
function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup()
must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).
When setup()
is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup()
is executed. If all of the data types are correctly identified enter
can be pressed to continue or quit
can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.
In later tutorials we will learn how to overwrite PyCaret's inferred data type using the numeric_features
and categorical_features
parameters in setup()
.
from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'Price', session_id=123)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Target | Price |
2 | Original Data | (5400, 8) |
3 | Missing Values | False |
4 | Numeric Features | 1 |
5 | Categorical Features | 6 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (3779, 28) |
10 | Transformed Test Set | (1621, 28) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | c90d |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | False |
46 | Polynomial Degree | None |
47 | Trignometry Features | False |
48 | Polynomial Threshold | None |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
Once the setup has been successfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup()
is executed. The majority of these features are out of scope for the purposes of this tutorial. However, a few important things to note at this stage include:
session_id
is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id
is set as 123
for later reproducibility.Notice how a few tasks that are imperative to perform modeling are automatically handled, such as missing value imputation (in this case there are no missing values in training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in setup()
are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.
Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using k-fold cross validation for metric evaluation. The output prints a score grid that shows average MAE, MSE, RMSE, R2, RMSLE and MAPE accross the folds (10 by default) along with training time.
best = compare_models(exclude = ['ransac'])
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
et | Extra Trees Regressor | 762.0118 | 2763999.1585 | 1612.2410 | 0.9729 | 0.0817 | 0.0607 | 0.1340 |
xgboost | Extreme Gradient Boosting | 708.8427 | 2799609.2534 | 1607.9791 | 0.9724 | 0.0743 | 0.0541 | 0.1600 |
rf | Random Forest Regressor | 760.6304 | 2929683.1860 | 1663.0148 | 0.9714 | 0.0818 | 0.0597 | 0.1260 |
lightgbm | Light Gradient Boosting Machine | 752.6446 | 3056347.8515 | 1687.9907 | 0.9711 | 0.0773 | 0.0567 | 0.0260 |
gbr | Gradient Boosting Regressor | 920.2913 | 3764303.9252 | 1901.1793 | 0.9633 | 0.1024 | 0.0770 | 0.0400 |
dt | Decision Tree Regressor | 1003.1237 | 5305620.3379 | 2228.7271 | 0.9476 | 0.1083 | 0.0775 | 0.0100 |
ridge | Ridge Regression | 2413.5704 | 14120492.3795 | 3726.1643 | 0.8621 | 0.6689 | 0.2875 | 0.5400 |
lasso | Lasso Regression | 2412.1922 | 14246798.1211 | 3744.2305 | 0.8608 | 0.6767 | 0.2866 | 0.5130 |
llar | Lasso Least Angle Regression | 2355.6152 | 14272020.4389 | 3745.3095 | 0.8607 | 0.6391 | 0.2728 | 0.0090 |
br | Bayesian Ridge | 2415.8031 | 14270771.8397 | 3746.9951 | 0.8606 | 0.6696 | 0.2873 | 0.0100 |
lr | Linear Regression | 2418.7036 | 14279370.2389 | 3748.9580 | 0.8604 | 0.6690 | 0.2879 | 1.3240 |
huber | Huber Regressor | 1936.1465 | 18599243.6579 | 4252.8771 | 0.8209 | 0.4333 | 0.1657 | 0.0220 |
par | Passive Aggressive Regressor | 1944.1634 | 19955672.9330 | 4400.2133 | 0.8083 | 0.4317 | 0.1594 | 0.0150 |
omp | Orthogonal Matching Pursuit | 2792.7313 | 23728654.4124 | 4829.3171 | 0.7678 | 0.5818 | 0.2654 | 0.0100 |
ada | AdaBoost Regressor | 4232.2217 | 25201423.0703 | 5012.4175 | 0.7467 | 0.5102 | 0.5970 | 0.0380 |
knn | K Neighbors Regressor | 2968.0750 | 29627913.0479 | 5421.7241 | 0.7051 | 0.3664 | 0.2730 | 0.0220 |
catboost | CatBoost Regressor | 380.6615 | 1130071.5563 | 787.8300 | 0.5899 | 0.0407 | 0.0299 | 0.3640 |
en | Elastic Net | 5029.5913 | 56399795.8780 | 7467.6598 | 0.4472 | 0.5369 | 0.5845 | 0.0100 |
dummy | Dummy Regressor | 7280.3308 | 101221941.4046 | 10032.1624 | -0.0014 | 0.7606 | 0.8969 | 0.0100 |
lar | Least Angle Regression | 11020.5511 | 1563301113.8194 | 20750.5953 | -16.8045 | 0.8989 | 1.5590 | 0.0090 |
Two simple words of code *(not even a line)* have trained and evaluated over 20 models using cross validation. The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using R2
(highest to lowest) which can be changed by passing sort
parameter. For example compare_models(sort = 'RMSLE')
will sort the grid by RMSLE (lower to higher since lower is better). If you want to change the fold parameter from the default value of 10
to a different value then you can use the fold
parameter. For example compare_models(fold = 5)
will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time. By default, compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select
parameter.
Notice that how exclude
parameter is used to block certain models (in this case RANSAC
).
create_model
is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using cross validation that can be set with fold parameter. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold.
For the remaining part of this tutorial, we will work with the below models as our candidate models. The selections are for illustration purposes only and do not necessarily mean they are the top performing or ideal for this type of data.
There are 25 regressors available in the model library of PyCaret. To see list of all regressors either check the docstring or use models
function to see the library.
models()
Name | Reference | Turbo | |
---|---|---|---|
ID | |||
lr | Linear Regression | sklearn.linear_model._base.LinearRegression | True |
lasso | Lasso Regression | sklearn.linear_model._coordinate_descent.Lasso | True |
ridge | Ridge Regression | sklearn.linear_model._ridge.Ridge | True |
en | Elastic Net | sklearn.linear_model._coordinate_descent.Elast... | True |
lar | Least Angle Regression | sklearn.linear_model._least_angle.Lars | True |
llar | Lasso Least Angle Regression | sklearn.linear_model._least_angle.LassoLars | True |
omp | Orthogonal Matching Pursuit | sklearn.linear_model._omp.OrthogonalMatchingPu... | True |
br | Bayesian Ridge | sklearn.linear_model._bayes.BayesianRidge | True |
ard | Automatic Relevance Determination | sklearn.linear_model._bayes.ARDRegression | False |
par | Passive Aggressive Regressor | sklearn.linear_model._passive_aggressive.Passi... | True |
ransac | Random Sample Consensus | sklearn.linear_model._ransac.RANSACRegressor | False |
tr | TheilSen Regressor | sklearn.linear_model._theil_sen.TheilSenRegressor | False |
huber | Huber Regressor | sklearn.linear_model._huber.HuberRegressor | True |
kr | Kernel Ridge | sklearn.kernel_ridge.KernelRidge | False |
svm | Support Vector Regression | sklearn.svm._classes.SVR | False |
knn | K Neighbors Regressor | sklearn.neighbors._regression.KNeighborsRegressor | True |
dt | Decision Tree Regressor | sklearn.tree._classes.DecisionTreeRegressor | True |
rf | Random Forest Regressor | sklearn.ensemble._forest.RandomForestRegressor | True |
et | Extra Trees Regressor | sklearn.ensemble._forest.ExtraTreesRegressor | True |
ada | AdaBoost Regressor | sklearn.ensemble._weight_boosting.AdaBoostRegr... | True |
gbr | Gradient Boosting Regressor | sklearn.ensemble._gb.GradientBoostingRegressor | True |
mlp | MLP Regressor | sklearn.neural_network._multilayer_perceptron.... | False |
xgboost | Extreme Gradient Boosting | xgboost.sklearn.XGBRegressor | True |
lightgbm | Light Gradient Boosting Machine | lightgbm.sklearn.LGBMRegressor | True |
catboost | CatBoost Regressor | catboost.core.CatBoostRegressor | True |
dummy | Dummy Regressor | sklearn.dummy.DummyRegressor | True |
ada = create_model('ada')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 4101.8809 | 23013830.0177 | 4797.2732 | 0.7473 | 0.4758 | 0.5470 |
1 | 4251.5693 | 29296751.6657 | 5412.6474 | 0.7755 | 0.4940 | 0.5702 |
2 | 4047.8474 | 22291660.1785 | 4721.4045 | 0.7955 | 0.5068 | 0.5871 |
3 | 4298.3867 | 23482783.6839 | 4845.9038 | 0.7409 | 0.5089 | 0.5960 |
4 | 3888.5584 | 24461807.7242 | 4945.8880 | 0.6949 | 0.4764 | 0.5461 |
5 | 4566.4889 | 29733914.8752 | 5452.8813 | 0.7462 | 0.5462 | 0.6598 |
6 | 4628.7271 | 27841092.1974 | 5276.4659 | 0.7384 | 0.5549 | 0.6676 |
7 | 4316.4317 | 25979752.0083 | 5097.0336 | 0.6715 | 0.5034 | 0.5858 |
8 | 3931.2163 | 21097072.3513 | 4593.1549 | 0.7928 | 0.4858 | 0.5513 |
9 | 4291.1097 | 24815566.0009 | 4981.5225 | 0.7637 | 0.5495 | 0.6592 |
Mean | 4232.2217 | 25201423.0703 | 5012.4175 | 0.7467 | 0.5102 | 0.5970 |
Std | 233.2282 | 2804219.3826 | 277.6577 | 0.0375 | 0.0284 | 0.0457 |
print(ada)
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear', n_estimators=50, random_state=123)
lightgbm = create_model('lightgbm')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 625.1813 | 1051762.9578 | 1025.5550 | 0.9885 | 0.0715 | 0.0526 |
1 | 797.6185 | 5638866.1771 | 2374.6297 | 0.9568 | 0.0727 | 0.0537 |
2 | 829.4586 | 3328375.4390 | 1824.3836 | 0.9695 | 0.0860 | 0.0619 |
3 | 720.3923 | 1697211.3816 | 1302.7707 | 0.9813 | 0.0714 | 0.0554 |
4 | 645.6800 | 1799949.1196 | 1341.6218 | 0.9775 | 0.0745 | 0.0534 |
5 | 830.7176 | 6423604.0184 | 2534.4830 | 0.9452 | 0.0810 | 0.0567 |
6 | 799.9136 | 3353992.2636 | 1831.3908 | 0.9685 | 0.0793 | 0.0585 |
7 | 714.3607 | 1930222.6458 | 1389.3245 | 0.9756 | 0.0732 | 0.0556 |
8 | 784.7648 | 2211933.1546 | 1487.2569 | 0.9783 | 0.0766 | 0.0582 |
9 | 778.3590 | 3127561.3571 | 1768.4913 | 0.9702 | 0.0872 | 0.0609 |
Mean | 752.6446 | 3056347.8515 | 1687.9907 | 0.9711 | 0.0773 | 0.0567 |
Std | 69.3829 | 1661349.5128 | 455.0112 | 0.0119 | 0.0055 | 0.0030 |
dt = create_model('dt')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 859.1907 | 2456840.0599 | 1567.4310 | 0.9730 | 0.1016 | 0.0727 |
1 | 1122.9409 | 9852564.2047 | 3138.8795 | 0.9245 | 0.1102 | 0.0758 |
2 | 911.3452 | 2803662.6885 | 1674.4141 | 0.9743 | 0.0988 | 0.0729 |
3 | 1002.5575 | 3926739.3726 | 1981.6002 | 0.9567 | 0.1049 | 0.0772 |
4 | 1167.8154 | 9751516.1909 | 3122.7418 | 0.8784 | 0.1226 | 0.0876 |
5 | 1047.7778 | 7833770.7037 | 2798.8874 | 0.9331 | 0.1128 | 0.0791 |
6 | 1010.0816 | 3989282.4802 | 1997.3188 | 0.9625 | 0.1106 | 0.0803 |
7 | 846.8085 | 2182534.9007 | 1477.3405 | 0.9724 | 0.0933 | 0.0709 |
8 | 1001.8451 | 4904945.0821 | 2214.7111 | 0.9518 | 0.1053 | 0.0734 |
9 | 1060.8742 | 5354347.6956 | 2313.9463 | 0.9490 | 0.1230 | 0.0847 |
Mean | 1003.1237 | 5305620.3379 | 2228.7271 | 0.9476 | 0.1083 | 0.0775 |
Std | 100.2165 | 2734194.7557 | 581.7181 | 0.0280 | 0.0091 | 0.0052 |
Notice that the Mean score of all models matches with the score printed in compare_models()
. This is because the metrics printed in the compare_models()
score grid are the average scores across all CV folds. Similar to compare_models()
, if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold
parameter. For Example: create_model('dt', fold = 5)
to create Decision Tree using 5 fold cross validation.
When a model is created using the create_model
function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model
function is used. This function automatically tunes the hyperparameters of a model using Random Grid Search
on a pre-defined search space. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold. To use the custom search grid, you can pass custom_grid
parameter in the tune_model
function (see 9.2 LightGBM tuning below).
tuned_ada = tune_model(ada)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 2629.7158 | 16222922.0054 | 4027.7689 | 0.8219 | 0.2553 | 0.2244 |
1 | 2764.7250 | 25273189.9003 | 5027.2448 | 0.8063 | 0.2714 | 0.2357 |
2 | 2605.9909 | 16883405.3119 | 4108.9421 | 0.8451 | 0.2617 | 0.2352 |
3 | 2588.0395 | 14475338.1062 | 3804.6469 | 0.8403 | 0.2685 | 0.2271 |
4 | 2403.7173 | 13602075.2435 | 3688.0991 | 0.8303 | 0.2672 | 0.2223 |
5 | 2538.7416 | 20724600.2592 | 4552.4280 | 0.8231 | 0.2644 | 0.2260 |
6 | 2720.2195 | 19796302.1522 | 4449.3036 | 0.8140 | 0.2644 | 0.2280 |
7 | 2707.6016 | 17084596.1502 | 4133.3517 | 0.7839 | 0.2743 | 0.2475 |
8 | 2444.0262 | 16340453.5625 | 4042.3327 | 0.8395 | 0.2623 | 0.2199 |
9 | 2545.6132 | 19267454.7853 | 4389.4709 | 0.8165 | 0.2680 | 0.2247 |
Mean | 2594.8391 | 17967033.7477 | 4222.3589 | 0.8221 | 0.2657 | 0.2291 |
Std | 111.1423 | 3238932.6224 | 372.4506 | 0.0174 | 0.0051 | 0.0078 |
print(tuned_ada)
AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear', n_estimators=90, random_state=123)
import numpy as np
lgbm_params = {'num_leaves': np.arange(10,200,10),
'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
'learning_rate': np.arange(0.1,1,0.1)
}
tuned_lightgbm = tune_model(lightgbm, custom_grid = lgbm_params)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 649.2541 | 1131046.4835 | 1063.5067 | 0.9876 | 0.0721 | 0.0544 |
1 | 785.8158 | 5518411.7880 | 2349.1300 | 0.9577 | 0.0730 | 0.0522 |
2 | 808.0977 | 3024520.4058 | 1739.1148 | 0.9723 | 0.0836 | 0.0597 |
3 | 749.7881 | 1774260.2775 | 1332.0136 | 0.9804 | 0.0724 | 0.0556 |
4 | 694.0351 | 1974576.4174 | 1405.1962 | 0.9754 | 0.0838 | 0.0585 |
5 | 841.6462 | 6725524.0654 | 2593.3615 | 0.9426 | 0.0824 | 0.0582 |
6 | 796.0240 | 3324498.6208 | 1823.3208 | 0.9688 | 0.0774 | 0.0564 |
7 | 713.1006 | 1872493.1136 | 1368.3907 | 0.9763 | 0.0715 | 0.0551 |
8 | 775.9760 | 2274682.3424 | 1508.2050 | 0.9777 | 0.0766 | 0.0579 |
9 | 768.3451 | 3247098.5445 | 1801.9707 | 0.9691 | 0.0885 | 0.0594 |
Mean | 758.2083 | 3086711.2059 | 1698.4210 | 0.9708 | 0.0781 | 0.0567 |
Std | 54.9147 | 1678033.7774 | 449.5301 | 0.0120 | 0.0058 | 0.0023 |
print(tuned_lightgbm)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=60, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=120, objective=None, random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn', subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
tuned_dt = tune_model(dt)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 1000.7122 | 2895159.1309 | 1701.5167 | 0.9682 | 0.1076 | 0.0828 |
1 | 1080.2841 | 6686388.0416 | 2585.8051 | 0.9488 | 0.1053 | 0.0814 |
2 | 1002.3163 | 3275429.6329 | 1809.8148 | 0.9700 | 0.1051 | 0.0812 |
3 | 1080.7850 | 4037154.5985 | 2009.2672 | 0.9555 | 0.1172 | 0.0870 |
4 | 1101.6333 | 7889520.5391 | 2808.8290 | 0.9016 | 0.1189 | 0.0842 |
5 | 1275.5901 | 11021312.1970 | 3319.8362 | 0.9059 | 0.1250 | 0.0895 |
6 | 1068.6534 | 4463866.3029 | 2112.7864 | 0.9581 | 0.1076 | 0.0809 |
7 | 975.9364 | 3271028.5175 | 1808.5985 | 0.9586 | 0.1099 | 0.0807 |
8 | 1101.9207 | 4441966.3616 | 2107.5973 | 0.9564 | 0.1114 | 0.0873 |
9 | 1065.1662 | 5192339.2748 | 2278.6705 | 0.9506 | 0.1224 | 0.0873 |
Mean | 1075.2997 | 5317416.4597 | 2254.2722 | 0.9474 | 0.1130 | 0.0842 |
Std | 79.0463 | 2416581.2427 | 485.4621 | 0.0227 | 0.0069 | 0.0031 |
By default, tune_model
optimizes R2
but this can be changed using optimize parameter. For example: tune_model(dt, optimize = 'MAE') will search for the hyperparameters of a Decision Tree Regressor that results in the lowest MAE
instead of highest R2
. For the purposes of this example, we have used the default metric R2
for the sake of simplicity only. The methodology behind selecting the right metric to evaluate a regressor is beyond the scope of this tutorial but if you would like to learn more about it, you can click here to develop an understanding on regression error metrics.
Metrics alone are not the only criteria you should consider when finalizing the best model for production. Other factors to consider include training time, standard deviation of k-folds etc. As you progress through the tutorial series we will discuss those factors in detail at the intermediate and expert levels. For now, let's move forward considering the Tuned Light Gradient Boosting Machine stored in the tuned_lightgbm
variable as our best model for the remainder of this tutorial.
Before model finalization, the plot_model()
function can be used to analyze the performance across different aspects such as Residuals Plot, Prediction Error, Feature Importance etc. This function takes a trained model object and returns a plot based on the test / hold-out set.
There are over 10 plots available, please see the plot_model()
docstring for the list of available plots.
plot_model(tuned_lightgbm)
plot_model(tuned_lightgbm, plot = 'error')
plot_model(tuned_lightgbm, plot='feature')
Another way to analyze the performance of models is to use the evaluate_model()
function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model()
function.
evaluate_model(tuned_lightgbm)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…
Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. If you look at the information grid in Section 6 above, you will see that 30% (1621 samples) of the data has been separated out as a test/hold-out sample. All of the evaluation metrics we have seen above are cross-validated results based on training set (70%) only. Now, using our final trained model stored in the tuned_lightgbm
variable we will predict the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.
predict_model(tuned_lightgbm);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Light Gradient Boosting Machine | 781.5572 | 3816757.2761 | 1953.6523 | 0.9652 | 0.0787 | 0.0558 |
The R2 on the test/hold-out set is 0.9652
compared to 0.9708
achieved on tuned_lightgbm
CV results (in section 9.2 above). This is not a significant difference. If there is a large variation between the test/hold-out and CV results, then this would normally indicate over-fitting but could also be due to several other factors and would require further investigation. In this case, we will move forward with finalizing the model and predicting on unseen data (the 10% that we had separated in the beginning and never exposed to PyCaret).
(TIP : It's always good to look at the standard deviation of CV results when using create_model
.)
Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup()
, followed by comparing all models using compare_models()
and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model()
function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.
final_lightgbm = finalize_model(tuned_lightgbm)
print(final_lightgbm)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=60, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=120, objective=None, random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn', subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
Caution: One final word of caution. Once the model is finalized using finalize_model()
, the entire dataset including the test/hold-out set is used for training. As such, if the model is used for predictions on the hold-out set after finalize_model()
is used, the information grid printed will be misleading as you are trying to predict on the same data that was used for modeling. In order to demonstrate this point only, we will use final_lightgbm
under predict_model()
to compare the information grid with the one above in section 11.
predict_model(final_lightgbm);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Light Gradient Boosting Machine | 459.9160 | 1199892.0334 | 1095.3958 | 0.9891 | 0.0498 | 0.0362 |
Notice how the R2 in the final_lightgbm
has increased to 0.9891
from 0.9652
, even though the model is same. This is because the final_lightgbm
variable is trained on the complete dataset including the test/hold-out set.
The predict_model()
function is also used to predict on the unseen dataset. The only difference from section 11 above is that this time we will pass the data_unseen
parameter. data_unseen
is the variable created at the beginning of the tutorial and contains 10% (600 samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)
unseen_predictions = predict_model(final_lightgbm, data=data_unseen)
unseen_predictions.head()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Light Gradient Boosting Machine | 707.9033 | 2268889.5439 | 1506.2834 | 0.9779 | 0.0696 | 0.0513 |
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | Label | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.53 | Ideal | E | SI1 | ID | ID | AGSL | 12791 | 12262.949782 |
1 | 1.50 | Fair | F | SI1 | VG | VG | GIA | 10450 | 10122.442382 |
2 | 1.01 | Good | E | SI1 | G | G | GIA | 5161 | 5032.520456 |
3 | 2.51 | Very Good | G | VS2 | VG | VG | GIA | 34361 | 34840.379469 |
4 | 1.01 | Good | I | SI1 | VG | VG | GIA | 4238 | 4142.695964 |
The Label
column is added onto the data_unseen
set. Label is the predicted value using the final_lightgbm
model. If you want predictions to be rounded, you can use round
parameter inside predict_model()
. You can also check the metrics on this since you have actual target column Price
available. To do that we will use pycaret.utils module. See example below:
from pycaret.utils import check_metric
check_metric(unseen_predictions.Price, unseen_predictions.Label, 'R2')
0.9779
We have now finished the experiment by finalizing the tuned_lightgbm
model which is now stored in final_lightgbm
variable. We have also used the model stored in final_lightgbm
to predict data_unseen
. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model()
allows you to save the model along with entire transformation pipeline for later use.
save_model(final_lightgbm,'Final LightGBM Model 25Nov2020')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='Price', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strategy='... LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=60, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=120, objective=None, random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn', subsample=1.0, subsample_for_bin=200000, subsample_freq=0)]], verbose=False), 'Final LightGBM Model 25Nov2020.pkl')
(TIP : It's always good to use date in the filename when saving models, it's good for version control.)
To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model()
function and then easily apply the saved model on new unseen data for prediction.
saved_final_lightgbm = load_model('Final LightGBM Model 25Nov2020')
Transformation Pipeline and Model Successfully Loaded
Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model()
function. Below we have applied the loaded model to predict the same data_unseen
that we used in section 13 above.
new_prediction = predict_model(saved_final_lightgbm, data=data_unseen)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Light Gradient Boosting Machine | 707.9033 | 2268889.5439 | 1506.2834 | 0.9779 | 0.0696 | 0.0513 |
new_prediction.head()
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | Label | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.53 | Ideal | E | SI1 | ID | ID | AGSL | 12791 | 12262.949782 |
1 | 1.50 | Fair | F | SI1 | VG | VG | GIA | 10450 | 10122.442382 |
2 | 1.01 | Good | E | SI1 | G | G | GIA | 5161 | 5032.520456 |
3 | 2.51 | Very Good | G | VS2 | VG | VG | GIA | 34361 | 34840.379469 |
4 | 1.01 | Good | I | SI1 | VG | VG | GIA | 4238 | 4142.695964 |
Notice that the results of unseen_predictions
and new_prediction
are identical.
from pycaret.utils import check_metric
check_metric(new_prediction.Price, new_prediction.Label, 'R2')
0.9779
This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction and saving the model for later use. We have completed all of these steps in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model()
, tune_model()
, compare_models()
. Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most libraries.
We have only covered the basics of pycaret.regression
. In following tutorials we will go deeper into advanced pre-processing, ensembling, generalized stacking and other techniques that allow you to fully customize your machine learning pipeline and are must know for any data scientist.
See you at the next tutorial. Follow the link to Regression Tutorial (REG102) - Level Intermediate