Last updated: 15 Feb 2023
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.
Compared with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.
The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.
PyCaret is tested and supported on the following 64-bit systems:
You can install PyCaret with Python's pip package manager:
pip install pycaret
PyCaret's default installation will not install all the extra dependencies automatically. For that you will have to install the full version:
pip install pycaret[full]
or depending on your use-case you may install one of the following variant:
pip install pycaret[analysis]
pip install pycaret[models]
pip install pycaret[tuner]
pip install pycaret[mlops]
pip install pycaret[parallel]
pip install pycaret[test]
# check installed version (must be >3.0)
import pycaret
pycaret.__version__
'3.0.0'
PyCaret's Regression Module is a supervised machine learning module that is used for estimating the relationships between a dependent variable (often called the outcome variable, or target) and one or more independent variables (often called features, predictors, or covariates).
The objective of regression is to predict continuous values such as predicting sales amount, predicting quantity, predicting temperature, etc. Regression module provides several pre-processing features to preprocess the data for modeling through the setup function.
PyCaret's regression module has many preprocessing capabilities and it coems with over 25 ready-to-use algorithms and several plots to analyze the performance of trained models.
A typical workflow in PyCaret Regression module consist of the following 5 steps in this order:
### load sample dataset from pycaret dataset module
from pycaret.datasets import get_data
data = get_data('insurance')
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
The setup
function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function in PyCaret. It only has two required parameters i.e. data
and target
. All the other parameters are optional.
# import pycaret regression and init setup
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Ordinal features | 2 |
8 | Numeric features | 3 |
9 | Categorical features | 3 |
10 | Preprocess | True |
11 | Imputation type | simple |
12 | Numeric imputation | mean |
13 | Categorical imputation | mode |
14 | Maximum one-hot encoding | 25 |
15 | Encoding method | None |
16 | Fold Generator | KFold |
17 | Fold Number | 10 |
18 | CPU Jobs | -1 |
19 | Use GPU | False |
20 | Log Experiment | False |
21 | Experiment Name | reg-default-name |
22 | USI | 9f1c |
Once the setup has been successfully executed it shows the information grid containing experiment level information.
session_id
is passed, a random number is automatically generated that is distributed to all functions.PyCaret has two set of API's that you can work with. (1) Functional (as seen above) and (2) Object Oriented API.
With Object Oriented API instead of executing functions directly you will import a class and execute methods of class.
# import RegressionExperiment and init the class
from pycaret.regression import RegressionExperiment
exp = RegressionExperiment()
# check the type of exp
type(exp)
pycaret.regression.oop.RegressionExperiment
# init setup on exp
exp.setup(data, target = 'charges', session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Ordinal features | 2 |
8 | Numeric features | 3 |
9 | Categorical features | 3 |
10 | Preprocess | True |
11 | Imputation type | simple |
12 | Numeric imputation | mean |
13 | Categorical imputation | mode |
14 | Maximum one-hot encoding | 25 |
15 | Encoding method | None |
16 | Fold Generator | KFold |
17 | Fold Number | 10 |
18 | CPU Jobs | -1 |
19 | Use GPU | False |
20 | Log Experiment | False |
21 | Experiment Name | reg-default-name |
22 | USI | 063d |
<pycaret.regression.oop.RegressionExperiment at 0x1697e9336a0>
You can use any of the two method i.e. Functional or OOP and even switch back and forth between two set of API's. The choice of method will not impact the results and has been tested for consistency. ___
The compare_models
function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics
function. Custom metrics can be added or removed using add_metric
and remove_metric
function.
# compare baseline models
best = compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9919 | 23548657.1177 | 4832.9329 | 0.8320 | 0.4447 | 0.3137 | 0.0570 |
rf | Random Forest Regressor | 2771.4583 | 25416502.3827 | 5028.6343 | 0.8172 | 0.4690 | 0.3303 | 0.0690 |
catboost | CatBoost Regressor | 2899.3783 | 25762701.9552 | 5057.5721 | 0.8163 | 0.4815 | 0.3522 | 0.0800 |
lightgbm | Light Gradient Boosting Machine | 2992.1828 | 25521038.3331 | 5042.0978 | 0.8149 | 0.5378 | 0.3751 | 0.1890 |
et | Extra Trees Regressor | 2833.3624 | 28427844.2412 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.0710 |
ada | AdaBoost Regressor | 4316.0568 | 29220505.6498 | 5398.4561 | 0.7903 | 0.6368 | 0.7394 | 0.0420 |
xgboost | Extreme Gradient Boosting | 3443.6091 | 32824626.4000 | 5711.2140 | 0.7626 | 0.6224 | 0.4469 | 0.0420 |
llar | Lasso Least Angle Regression | 4298.6038 | 38369142.0849 | 6174.9424 | 0.7309 | 0.5786 | 0.4424 | 0.0400 |
ridge | Ridge Regression | 4317.6984 | 38396435.9578 | 6177.2329 | 0.7306 | 0.5891 | 0.4459 | 0.0380 |
br | Bayesian Ridge | 4311.2349 | 38391950.0874 | 6176.8896 | 0.7306 | 0.5910 | 0.4447 | 0.0400 |
lar | Least Angle Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.0340 |
lasso | Lasso Regression | 4303.7697 | 38386797.6709 | 6176.4824 | 0.7306 | 0.5952 | 0.4434 | 0.0340 |
lr | Linear Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.8830 |
huber | Huber Regressor | 3463.2216 | 48801106.4612 | 6963.9984 | 0.6544 | 0.4927 | 0.2212 | 0.0440 |
dt | Decision Tree Regressor | 3383.4916 | 47823199.0729 | 6895.7016 | 0.6497 | 0.5602 | 0.4013 | 0.0390 |
omp | Orthogonal Matching Pursuit | 5754.7769 | 57503207.7233 | 7566.7086 | 0.5997 | 0.7418 | 0.8990 | 0.0430 |
par | Passive Aggressive Regressor | 4537.0122 | 67346309.9218 | 8142.7826 | 0.5422 | 0.5276 | 0.3207 | 0.0420 |
en | Elastic Net | 7372.5238 | 90450782.5713 | 9468.3193 | 0.3792 | 0.7342 | 0.9184 | 0.0390 |
knn | K Neighbors Regressor | 8007.7997 | 131387268.8000 | 11425.3695 | 0.0859 | 0.8535 | 0.9232 | 0.0430 |
dummy | Dummy Regressor | 9192.5418 | 148516792.8000 | 12132.4733 | -0.0175 | 1.0154 | 1.5637 | 0.0410 |
Processing: 0%| | 0/85 [00:00<?, ?it/s]
# compare models using OOP
# exp.compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9919 | 23548657.1177 | 4832.9329 | 0.8320 | 0.4447 | 0.3137 | 0.0540 |
rf | Random Forest Regressor | 2771.4583 | 25416502.3827 | 5028.6343 | 0.8172 | 0.4690 | 0.3303 | 0.0710 |
catboost | CatBoost Regressor | 2899.3783 | 25762701.9552 | 5057.5721 | 0.8163 | 0.4815 | 0.3522 | 0.0370 |
lightgbm | Light Gradient Boosting Machine | 2992.1828 | 25521038.3331 | 5042.0978 | 0.8149 | 0.5378 | 0.3751 | 0.0470 |
et | Extra Trees Regressor | 2833.3624 | 28427844.2412 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.0730 |
ada | AdaBoost Regressor | 4316.0568 | 29220505.6498 | 5398.4561 | 0.7903 | 0.6368 | 0.7394 | 0.0430 |
xgboost | Extreme Gradient Boosting | 3443.6091 | 32824626.4000 | 5711.2140 | 0.7626 | 0.6224 | 0.4469 | 0.0390 |
llar | Lasso Least Angle Regression | 4298.6038 | 38369142.0849 | 6174.9424 | 0.7309 | 0.5786 | 0.4424 | 0.0460 |
ridge | Ridge Regression | 4317.6984 | 38396435.9578 | 6177.2329 | 0.7306 | 0.5891 | 0.4459 | 0.0400 |
br | Bayesian Ridge | 4311.2349 | 38391950.0874 | 6176.8896 | 0.7306 | 0.5910 | 0.4447 | 0.0400 |
lar | Least Angle Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.0360 |
lasso | Lasso Regression | 4303.7697 | 38386797.6709 | 6176.4824 | 0.7306 | 0.5952 | 0.4434 | 0.0360 |
lr | Linear Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.0420 |
huber | Huber Regressor | 3463.2216 | 48801106.4612 | 6963.9984 | 0.6544 | 0.4927 | 0.2212 | 0.0460 |
dt | Decision Tree Regressor | 3383.4916 | 47823199.0729 | 6895.7016 | 0.6497 | 0.5602 | 0.4013 | 0.0390 |
omp | Orthogonal Matching Pursuit | 5754.7769 | 57503207.7233 | 7566.7086 | 0.5997 | 0.7418 | 0.8990 | 0.0420 |
par | Passive Aggressive Regressor | 4537.0122 | 67346309.9218 | 8142.7826 | 0.5422 | 0.5276 | 0.3207 | 0.0440 |
en | Elastic Net | 7372.5238 | 90450782.5713 | 9468.3193 | 0.3792 | 0.7342 | 0.9184 | 0.0460 |
knn | K Neighbors Regressor | 8007.7997 | 131387268.8000 | 11425.3695 | 0.0859 | 0.8535 | 0.9232 | 0.0430 |
dummy | Dummy Regressor | 9192.5418 | 148516792.8000 | 12132.4733 | -0.0175 | 1.0154 | 1.5637 | 0.0440 |
Processing: 0%| | 0/85 [00:00<?, ?it/s]
GradientBoostingRegressor(random_state=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=123)
Notice that the output between functional and OOP API is consistent. Rest of the functions in this notebook will only be shown using functional API only.
The plot_model
function is used to analyze the performance of a trained model on the test set. It may require re-training the model in certain cases.
# plot residuals
plot_model(best, plot = 'residuals')
# plot error
plot_model(best, plot = 'error')
# plot feature importance
plot_model(best, plot = 'feature')
# check docstring to see available plots
# help(plot_model)
An alternate to plot_model
function is evaluate_model
. It can only be used in Notebook since it uses ipywidget
.
evaluate_model(best)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…
The predict_model
function returns prediction_label
as new column to the input dataframe. When data is None
(default), it uses the test set (created during the setup function) for scoring.
# predict on test set
holdout_pred = predict_model(best)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Gradient Boosting Regressor | 2392.5661 | 17148355.3169 | 4141.0573 | 0.8800 | 0.3928 | 0.2875 |
# show predictions df
holdout_pred.head()
age | sex | bmi | children | smoker | region | charges | prediction_label | |
---|---|---|---|---|---|---|---|---|
936 | 49 | female | 42.680000 | 2 | no | southeast | 9800.888672 | 10681.513104 |
937 | 32 | male | 37.334999 | 1 | no | northeast | 4667.607422 | 8043.453463 |
938 | 27 | female | 31.400000 | 0 | yes | southwest | 34838.871094 | 36153.097686 |
939 | 35 | male | 24.129999 | 1 | no | northwest | 5125.215820 | 7435.516853 |
940 | 60 | male | 25.740000 | 0 | no | southeast | 12142.578125 | 14676.544334 |
The same function works for predicting the labels on unseen dataset. Let's create a copy of original data and drop the charges
. We can then use the new data frame without labels for scoring.
# copy data and drop charges
new_data = data.copy()
new_data.drop('charges', axis=1, inplace=True)
new_data.head()
age | sex | bmi | children | smoker | region | |
---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest |
1 | 18 | male | 33.770 | 1 | no | southeast |
2 | 28 | male | 33.000 | 3 | no | southeast |
3 | 33 | male | 22.705 | 0 | no | northwest |
4 | 32 | male | 28.880 | 0 | no | northwest |
# predict model on new_data
predictions = predict_model(best, data = new_data)
predictions.head()
age | sex | bmi | children | smoker | region | prediction_label | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900000 | 0 | yes | southwest | 18464.334448 |
1 | 18 | male | 33.770000 | 1 | no | southeast | 4020.345384 |
2 | 28 | male | 33.000000 | 3 | no | southeast | 6555.388388 |
3 | 33 | male | 22.705000 | 0 | no | northwest | 9627.045725 |
4 | 32 | male | 28.879999 | 0 | no | northwest | 3325.531292 |
Finally, you can save the entire pipeline on disk for later use, using pycaret's save_model
function.
# save pipeline
save_model(best, 'my_first_pipeline')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['age', 'bmi', 'children'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=['sex', 'smoker', 'region'], transformer=SimpleImputer(strategy='most_frequent'))), ('ordinal_encoding', TransformerW... handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}]))), ('onehot_encoding', TransformerWrapper(include=['region'], transformer=OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True))), ('trained_model', GradientBoostingRegressor(random_state=123))]), 'my_first_pipeline.pkl')
# load pipeline
loaded_best_pipeline = load_model('my_first_pipeline')
loaded_best_pipeline
Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['age', 'bmi', 'children'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=['sex', 'smoker', 'region'], transformer=SimpleImputer(strategy='most_frequent'))), ('ordinal_encoding', TransformerW... handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}]))), ('onehot_encoding', TransformerWrapper(include=['region'], transformer=OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True))), ('trained_model', GradientBoostingRegressor(random_state=123))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['age', 'bmi', 'children'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=['sex', 'smoker', 'region'], transformer=SimpleImputer(strategy='most_frequent'))), ('ordinal_encoding', TransformerW... handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}]))), ('onehot_encoding', TransformerWrapper(include=['region'], transformer=OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True))), ('trained_model', GradientBoostingRegressor(random_state=123))])
TransformerWrapper(include=['age', 'bmi', 'children'], transformer=SimpleImputer())
SimpleImputer()
SimpleImputer()
TransformerWrapper(include=['sex', 'smoker', 'region'], transformer=SimpleImputer(strategy='most_frequent'))
SimpleImputer(strategy='most_frequent')
SimpleImputer(strategy='most_frequent')
TransformerWrapper(include=['sex', 'smoker'], transformer=OrdinalEncoder(cols=['sex', 'smoker'], handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}]))
OrdinalEncoder(cols=['sex', 'smoker'], handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}])
OrdinalEncoder(cols=['sex', 'smoker'], handle_missing='return_nan', mapping=[{'col': 'sex', 'mapping': {nan: -1, 'female': 0, 'male': 1}}, {'col': 'smoker', 'mapping': {nan: -1, 'no': 0, 'yes': 1}}])
TransformerWrapper(include=['region'], transformer=OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True))
OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True)
OneHotEncoder(cols=['region'], handle_missing='return_nan', use_cat_names=True)
GradientBoostingRegressor(random_state=123)
The setup
function initializes the experiment in PyCaret and creates the transformation pipeline based on all the parameters passed in the function. Setup function must be called before executing any other function. It takes two required parameters: data
and target
. All the other parameters are optional and are used for configuring data preprocessing pipeline.
s = setup(data, target = 'charges', session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Ordinal features | 2 |
8 | Numeric features | 3 |
9 | Categorical features | 3 |
10 | Preprocess | True |
11 | Imputation type | simple |
12 | Numeric imputation | mean |
13 | Categorical imputation | mode |
14 | Maximum one-hot encoding | 25 |
15 | Encoding method | None |
16 | Fold Generator | KFold |
17 | Fold Number | 10 |
18 | CPU Jobs | -1 |
19 | Use GPU | False |
20 | Log Experiment | False |
21 | Experiment Name | reg-default-name |
22 | USI | 02ce |
To access all the variables created by the setup function such as transformed dataset, random_state, etc. you can use get_config
method.
# check all available config
get_config()
{'USI', 'X', 'X_test', 'X_test_transformed', 'X_train', 'X_train_transformed', 'X_transformed', '_available_plots', '_ml_usecase', 'data', 'dataset', 'dataset_transformed', 'exp_id', 'exp_name_log', 'fold_generator', 'fold_groups_param', 'fold_shuffle_param', 'gpu_n_jobs_param', 'gpu_param', 'html_param', 'idx', 'is_multiclass', 'log_plots_param', 'logging_param', 'memory', 'n_jobs_param', 'pipeline', 'seed', 'target_param', 'test', 'test_transformed', 'train', 'train_transformed', 'transform_target_param', 'variable_and_property_keys', 'variables', 'y', 'y_test', 'y_test_transformed', 'y_train', 'y_train_transformed', 'y_transformed'}
# lets access X_train_transformed
get_config('X_train_transformed')
age | sex | bmi | children | smoker | region_northeast | region_southwest | region_southeast | region_northwest | |
---|---|---|---|---|---|---|---|---|---|
0 | 36.0 | 1.0 | 27.549999 | 3.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 60.0 | 0.0 | 35.099998 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 30.0 | 1.0 | 31.570000 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 49.0 | 1.0 | 25.600000 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 26.0 | 1.0 | 32.900002 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
931 | 37.0 | 1.0 | 22.705000 | 3.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
932 | 20.0 | 0.0 | 31.920000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
933 | 19.0 | 0.0 | 28.400000 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
934 | 18.0 | 1.0 | 23.084999 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
935 | 53.0 | 0.0 | 36.860001 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
936 rows × 9 columns
# another example: let's access seed
print("The current seed is: {}".format(get_config('seed')))
# now lets change it using set_config
set_config('seed', 786)
print("The new seed is: {}".format(get_config('seed')))
The current seed is: 123 The new seed is: 786
All the preprocessing configurations and experiment settings/parameters are passed into the setup
function. To see all available parameters, check the docstring:
# help(setup)
# init setup with normalize = True
s = setup(data, target = 'charges', session_id = 123,
normalize = True, normalize_method = 'minmax')
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Ordinal features | 2 |
8 | Numeric features | 3 |
9 | Categorical features | 3 |
10 | Preprocess | True |
11 | Imputation type | simple |
12 | Numeric imputation | mean |
13 | Categorical imputation | mode |
14 | Maximum one-hot encoding | 25 |
15 | Encoding method | None |
16 | Normalize | True |
17 | Normalize method | minmax |
18 | Fold Generator | KFold |
19 | Fold Number | 10 |
20 | CPU Jobs | -1 |
21 | Use GPU | False |
22 | Log Experiment | False |
23 | Experiment Name | reg-default-name |
24 | USI | 3dce |
# lets check the X_train_transformed to see effect of params passed
get_config('X_train_transformed')['age'].hist()
<AxesSubplot:>
Notice that all the values are between 0 and 1 - that is because we passed normalize=True
in the setup
function. If you don't remember how it compares to actual data, no problem - we can also access non-transformed values using get_config
and then compare. See below and notice the range of values on x-axis and compare it with histogram above.
get_config('X_train')['age'].hist()
<AxesSubplot:>
The compare_models
function trains and evaluates the performance of all estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics
function. Custom metrics can be added or removed using add_metric
and remove_metric
function.
best = compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9135 | 23548622.1598 | 4832.9291 | 0.8320 | 0.4447 | 0.3137 | 0.0620 |
rf | Random Forest Regressor | 2772.9195 | 25409792.9692 | 5028.1973 | 0.8173 | 0.4687 | 0.3298 | 0.0750 |
catboost | CatBoost Regressor | 2899.4825 | 25762752.2096 | 5057.5778 | 0.8163 | 0.4815 | 0.3522 | 0.0430 |
lightgbm | Light Gradient Boosting Machine | 3001.8884 | 25547324.5813 | 5044.5767 | 0.8147 | 0.5445 | 0.3784 | 0.0520 |
et | Extra Trees Regressor | 2833.3624 | 28427844.2412 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.0800 |
ada | AdaBoost Regressor | 4175.5916 | 28401799.0579 | 5321.7006 | 0.7976 | 0.6263 | 0.7144 | 0.0490 |
xgboost | Extreme Gradient Boosting | 3439.8892 | 32826514.4000 | 5711.7335 | 0.7626 | 0.6221 | 0.4465 | 0.0450 |
llar | Lasso Least Angle Regression | 4298.6038 | 38369142.0849 | 6174.9424 | 0.7309 | 0.5786 | 0.4424 | 0.0360 |
ridge | Ridge Regression | 4296.0642 | 38392999.7849 | 6176.6160 | 0.7308 | 0.5710 | 0.4397 | 0.0390 |
br | Bayesian Ridge | 4300.6286 | 38387539.9069 | 6176.4192 | 0.7307 | 0.5881 | 0.4419 | 0.0500 |
lasso | Lasso Regression | 4302.2469 | 38386534.5553 | 6176.4463 | 0.7306 | 0.5913 | 0.4430 | 0.0410 |
lar | Least Angle Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.0390 |
lr | Linear Regression | 4312.6186 | 38452749.8007 | 6182.4796 | 0.7298 | 0.6285 | 0.4460 | 0.0380 |
knn | K Neighbors Regressor | 3778.4582 | 38143971.2000 | 6165.0463 | 0.7277 | 0.5027 | 0.3690 | 0.0400 |
par | Passive Aggressive Regressor | 3536.1733 | 48501878.1363 | 6940.1967 | 0.6566 | 0.4785 | 0.2154 | 0.0430 |
huber | Huber Regressor | 3461.7327 | 49057640.5613 | 6981.8576 | 0.6528 | 0.4815 | 0.2188 | 0.0450 |
dt | Decision Tree Regressor | 3399.1402 | 48100203.3847 | 6915.2984 | 0.6476 | 0.5629 | 0.4052 | 0.0410 |
omp | Orthogonal Matching Pursuit | 5754.7769 | 57503207.7233 | 7566.7086 | 0.5997 | 0.7418 | 0.8990 | 0.0440 |
en | Elastic Net | 7571.4598 | 104738034.4707 | 10182.3291 | 0.2846 | 0.8954 | 1.2888 | 0.0380 |
dummy | Dummy Regressor | 9192.5418 | 148516792.8000 | 12132.4733 | -0.0175 | 1.0154 | 1.5637 | 0.0420 |
Processing: 0%| | 0/85 [00:00<?, ?it/s]
compare_models
by default uses all the estimators in model library (all except models with Turbo=False
) . To see all available models you can use the function models()
# check available models
models()
Name | Reference | Turbo | |
---|---|---|---|
ID | |||
lr | Linear Regression | sklearn.linear_model._base.LinearRegression | True |
lasso | Lasso Regression | sklearn.linear_model._coordinate_descent.Lasso | True |
ridge | Ridge Regression | sklearn.linear_model._ridge.Ridge | True |
en | Elastic Net | sklearn.linear_model._coordinate_descent.Elast... | True |
lar | Least Angle Regression | sklearn.linear_model._least_angle.Lars | True |
llar | Lasso Least Angle Regression | sklearn.linear_model._least_angle.LassoLars | True |
omp | Orthogonal Matching Pursuit | sklearn.linear_model._omp.OrthogonalMatchingPu... | True |
br | Bayesian Ridge | sklearn.linear_model._bayes.BayesianRidge | True |
ard | Automatic Relevance Determination | sklearn.linear_model._bayes.ARDRegression | False |
par | Passive Aggressive Regressor | sklearn.linear_model._passive_aggressive.Passi... | True |
ransac | Random Sample Consensus | sklearn.linear_model._ransac.RANSACRegressor | False |
tr | TheilSen Regressor | sklearn.linear_model._theil_sen.TheilSenRegressor | False |
huber | Huber Regressor | sklearn.linear_model._huber.HuberRegressor | True |
kr | Kernel Ridge | sklearn.kernel_ridge.KernelRidge | False |
svm | Support Vector Regression | sklearn.svm._classes.SVR | False |
knn | K Neighbors Regressor | sklearn.neighbors._regression.KNeighborsRegressor | True |
dt | Decision Tree Regressor | sklearn.tree._classes.DecisionTreeRegressor | True |
rf | Random Forest Regressor | sklearn.ensemble._forest.RandomForestRegressor | True |
et | Extra Trees Regressor | sklearn.ensemble._forest.ExtraTreesRegressor | True |
ada | AdaBoost Regressor | sklearn.ensemble._weight_boosting.AdaBoostRegr... | True |
gbr | Gradient Boosting Regressor | sklearn.ensemble._gb.GradientBoostingRegressor | True |
mlp | MLP Regressor | sklearn.neural_network._multilayer_perceptron.... | False |
xgboost | Extreme Gradient Boosting | xgboost.sklearn.XGBRegressor | True |
lightgbm | Light Gradient Boosting Machine | lightgbm.sklearn.LGBMRegressor | True |
catboost | CatBoost Regressor | catboost.core.CatBoostRegressor | True |
dummy | Dummy Regressor | sklearn.dummy.DummyRegressor | True |
You can use the include
and exclude
parameter in the compare_models
to train only select model or exclude specific models from training by passing the model id's in exclude
parameter.
compare_tree_models = compare_models(include = ['dt', 'rf', 'et', 'gbr', 'xgboost', 'lightgbm', 'catboost'])
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9135 | 23548622.1598 | 4832.9291 | 0.8320 | 0.4447 | 0.3137 | 0.0640 |
rf | Random Forest Regressor | 2772.9195 | 25409792.9692 | 5028.1973 | 0.8173 | 0.4687 | 0.3298 | 0.0750 |
catboost | CatBoost Regressor | 2899.4825 | 25762752.2096 | 5057.5778 | 0.8163 | 0.4815 | 0.3522 | 0.0460 |
lightgbm | Light Gradient Boosting Machine | 3001.8884 | 25547324.5813 | 5044.5767 | 0.8147 | 0.5445 | 0.3784 | 0.0480 |
et | Extra Trees Regressor | 2833.3624 | 28427844.2412 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.0760 |
xgboost | Extreme Gradient Boosting | 3439.8892 | 32826514.4000 | 5711.7335 | 0.7626 | 0.6221 | 0.4465 | 0.0420 |
dt | Decision Tree Regressor | 3399.1402 | 48100203.3847 | 6915.2984 | 0.6476 | 0.5629 | 0.4052 | 0.0410 |
Processing: 0%| | 0/33 [00:00<?, ?it/s]
compare_tree_models
GradientBoostingRegressor(random_state=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=123)
The function above has return trained model object as an output. The scoring grid is only displayed and not returned. If you need access to the scoring grid you can use pull
function to access the dataframe.
compare_tree_models_results = pull()
compare_tree_models_results
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9135 | 2.354862e+07 | 4832.9291 | 0.8320 | 0.4447 | 0.3137 | 0.064 |
rf | Random Forest Regressor | 2772.9195 | 2.540979e+07 | 5028.1973 | 0.8173 | 0.4687 | 0.3298 | 0.075 |
catboost | CatBoost Regressor | 2899.4825 | 2.576275e+07 | 5057.5778 | 0.8163 | 0.4815 | 0.3522 | 0.046 |
lightgbm | Light Gradient Boosting Machine | 3001.8884 | 2.554732e+07 | 5044.5767 | 0.8147 | 0.5445 | 0.3784 | 0.048 |
et | Extra Trees Regressor | 2833.3624 | 2.842784e+07 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.076 |
xgboost | Extreme Gradient Boosting | 3439.8892 | 3.282651e+07 | 5711.7335 | 0.7626 | 0.6221 | 0.4465 | 0.042 |
dt | Decision Tree Regressor | 3399.1402 | 4.810020e+07 | 6915.2984 | 0.6476 | 0.5629 | 0.4052 | 0.041 |
By default compare_models
return the single best performing model based on the metric defined in the sort
parameter. Let's change our code to return 3 top models based on MAE
.
best_mae_models_top3 = compare_models(sort = 'MAE', n_select = 3)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2701.9135 | 23548622.1598 | 4832.9291 | 0.8320 | 0.4447 | 0.3137 | 0.0640 |
rf | Random Forest Regressor | 2772.9195 | 25409792.9692 | 5028.1973 | 0.8173 | 0.4687 | 0.3298 | 0.0800 |
et | Extra Trees Regressor | 2833.3624 | 28427844.2412 | 5305.6516 | 0.7991 | 0.4877 | 0.3363 | 0.0800 |
catboost | CatBoost Regressor | 2899.4825 | 25762752.2096 | 5057.5778 | 0.8163 | 0.4815 | 0.3522 | 0.0420 |
lightgbm | Light Gradient Boosting Machine | 3001.8884 | 25547324.5813 | 5044.5767 | 0.8147 | 0.5445 | 0.3784 | 0.0500 |
dt | Decision Tree Regressor | 3399.1402 | 48100203.3847 | 6915.2984 | 0.6476 | 0.5629 | 0.4052 | 0.0430 |
xgboost | Extreme Gradient Boosting | 3439.8892 | 32826514.4000 | 5711.7335 | 0.7626 | 0.6221 | 0.4465 | 0.0530 |
huber | Huber Regressor | 3461.7327 | 49057640.5613 | 6981.8576 | 0.6528 | 0.4815 | 0.2188 | 0.0490 |
par | Passive Aggressive Regressor | 3536.1733 | 48501878.1363 | 6940.1967 | 0.6566 | 0.4785 | 0.2154 | 0.0480 |
knn | K Neighbors Regressor | 3778.4582 | 38143971.2000 | 6165.0463 | 0.7277 | 0.5027 | 0.3690 | 0.0470 |
ada | AdaBoost Regressor | 4175.5916 | 28401799.0579 | 5321.7006 | 0.7976 | 0.6263 | 0.7144 | 0.0470 |
ridge | Ridge Regression | 4296.0642 | 38392999.7849 | 6176.6160 | 0.7308 | 0.5710 | 0.4397 | 0.0420 |
llar | Lasso Least Angle Regression | 4298.6038 | 38369142.0849 | 6174.9424 | 0.7309 | 0.5786 | 0.4424 | 0.0450 |
br | Bayesian Ridge | 4300.6286 | 38387539.9069 | 6176.4192 | 0.7307 | 0.5881 | 0.4419 | 0.0480 |
lasso | Lasso Regression | 4302.2469 | 38386534.5553 | 6176.4463 | 0.7306 | 0.5913 | 0.4430 | 0.0430 |
lar | Least Angle Regression | 4303.5559 | 38388058.4578 | 6176.5920 | 0.7306 | 0.5949 | 0.4433 | 0.0420 |
lr | Linear Regression | 4312.6186 | 38452749.8007 | 6182.4796 | 0.7298 | 0.6285 | 0.4460 | 0.0430 |
omp | Orthogonal Matching Pursuit | 5754.7769 | 57503207.7233 | 7566.7086 | 0.5997 | 0.7418 | 0.8990 | 0.0460 |
en | Elastic Net | 7571.4598 | 104738034.4707 | 10182.3291 | 0.2846 | 0.8954 | 1.2888 | 0.0450 |
dummy | Dummy Regressor | 9192.5418 | 148516792.8000 | 12132.4733 | -0.0175 | 1.0154 | 1.5637 | 0.0400 |
Processing: 0%| | 0/87 [00:00<?, ?it/s]
# list of top 3 models by MAE
best_mae_models_top3
[GradientBoostingRegressor(random_state=123), RandomForestRegressor(n_jobs=-1, random_state=123), ExtraTreesRegressor(n_jobs=-1, random_state=123)]
Some other parameters that you might find very useful in compare_models
are:
You can check the docstring of the function for more info.
# help(compare_models)
PyCaret integrates with many different type of experiment loggers (default = 'mlflow'). To turn on experiment tracking in PyCaret you can set log_experiment
and experiment_name
parameter. It will automatically track all the metrics, hyperparameters, and artifacts based on the defined logger.
# from pycaret.regression import *
# s = setup(data, target = 'charges', log_experiment='mlflow', experiment_name='insurance_experiment')
# compare models
# best = compare_models()