#!/usr/bin/env python # coding: utf-8 # Last updated: 16 Feb 2023 # # # 👋 PyCaret Time Series Forecasting Tutorial # # PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive. # # Compared with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more. # # The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise. # # # 💻 Installation # # PyCaret is tested and supported on the following 64-bit systems: # - Python 3.7 – 3.10 # - Python 3.9 for Ubuntu only # - Ubuntu 16.04 or later # - Windows 7 or later # # You can install PyCaret with Python's pip package manager: # # `pip install pycaret` # # PyCaret's default installation will not install all the extra dependencies automatically. For that you will have to install the full version: # # `pip install pycaret[full]` # # or depending on your use-case you may install one of the following variant: # # - `pip install pycaret[analysis]` # - `pip install pycaret[models]` # - `pip install pycaret[tuner]` # - `pip install pycaret[mlops]` # - `pip install pycaret[parallel]` # - `pip install pycaret[test]` # In[1]: # check installed version import pycaret pycaret.__version__ # # 🚀 Quick start # PyCaret's time series forecasting module is now available. The module currently is suitable for univariate / multivariate time series forecasting tasks. The API of time series module is consistent with other modules of PyCaret. # # It comes built-in with preprocessing capabilities and over 30 algorithms comprising of statistical / time-series methods as well as machine learning based models. In addition to the model training, this module has lot of other capabilities such as automated hyperparameter tuning, ensembling, model analysis, model packaging and deployment capabilities. # # A typical workflow in PyCaret consist of following 5 steps in this order: # # ### **Setup** ➡️ **Compare Models** ➡️ **Analyze Model** ➡️ **Prediction** ➡️ **Save Model**
# In[2]: ### loading sample dataset from pycaret dataset module from pycaret.datasets import get_data data = get_data('airline') # In[3]: # plot the dataset data.plot() # ## Setup # This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function in PyCaret. `Setup` has only one required parameter i.e. `data`. All the other parameters are optional. # In[4]: # import pycaret time series and init setup from pycaret.time_series import * s = setup(data, fh = 3, session_id = 123) # Once the setup has been successfully executed it shows the information grid containing experiment level information. # # - **Session id:** A pseudo-random number distributed as a seed in all functions for later reproducibility. If no `session_id` is passed, a random number is automatically generated that is distributed to all functions.
#
# - **Approach:** Univariate or multivariate.
#
# - **Exogenous Variables:** Exogeneous variables to be used in model.
#
# - **Original data shape:** Shape of the original data prior to any transformations.
#
# - **Transformed train set shape :** Shape of transformed train set
#
# - **Transformed test set shape :** Shape of transformed test set
#
# PyCaret has two set of API's that you can work with. (1) Functional (as seen above) and (2) Object Oriented API. # # With Object Oriented API instead of executing functions directly you will import a class and execute methods of class. # In[5]: # import TSForecastingExperiment and init the class from pycaret.time_series import TSForecastingExperiment exp = TSForecastingExperiment() # In[6]: # check the type of exp type(exp) # In[7]: # init setup on exp exp.setup(data, fh = 3, session_id = 123) # You can use any of the two method i.e. Functional or OOP and even switch back and forth between two set of API's. The choice of method will not impact the results and has been tested for consistency. # ## Check Stats # The `check_stats` function is used to get summary statistics and run statistical tests on the original data or model residuals. # In[8]: # check statistical tests on original data check_stats() # ## Compare Models # # This function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the `get_metrics` function. Custom metrics can be added or removed using `add_metric` and `remove_metric` function. # In[9]: # compare baseline models best = compare_models() # In[10]: # compare models using OOP exp.compare_models() # Notice that the output between functional and OOP API is consistent. Rest of the functions in this notebook will only be shown using functional API only. # # ___ # ## Analyze Model # You can use the `plot_model` function to analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases. # In[11]: # plot forecast plot_model(best, plot = 'forecast') # In[12]: # plot forecast for 36 months in future plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 36}) # In[13]: # residuals plot plot_model(best, plot = 'residuals') # In[14]: # check docstring to see available plots # help(plot_model) # An alternate to `plot_model` function is `evaluate_model`. It can only be used in Notebook since it uses ipywidget. # ## Prediction # The `predict_model` function returns `y_pred`. When data is `None` (default), it uses `fh` as defined during the `setup` function. # In[15]: # predict on test set holdout_pred = predict_model(best) # In[16]: # show predictions df holdout_pred.head() # In[17]: # generate forecast for 36 period in future predict_model(best, fh = 36) # ## Save Model # Finally, you can save the entire pipeline on disk for later use, using pycaret's `save_model` function. # In[18]: # save pipeline save_model(best, 'my_first_pipeline') # In[19]: # load pipeline loaded_best_pipeline = load_model('my_first_pipeline') loaded_best_pipeline # # 👇 Detailed function-by-function overview # ## ✅ Setup # This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function in PyCaret. `Setup` has only one required parameter i.e. `data`. All the other parameters are optional. # In[20]: s = setup(data, fh = 3, session_id = 123) # To access all the variables created by the setup function such as transformed dataset, random_state, etc. you can use `get_config` method. # In[21]: # check all available config get_config() # In[22]: # lets access y_train_transformed get_config('y_train_transformed') # In[23]: # another example: let's access seed print("The current seed is: {}".format(get_config('seed'))) # now lets change it using set_config set_config('seed', 786) print("The new seed is: {}".format(get_config('seed'))) # All the preprocessing configurations and experiment settings/parameters are passed into the `setup` function. To see all available parameters, check the docstring: # In[24]: # help(setup) # In[25]: # init setup fold_strategy = expanding s = setup(data, fh = 3, session_id = 123, fold_strategy = 'expanding', numeric_imputation_target = 'drift') # ## ✅ Compare Models # This function trains and evaluates the performance of all estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the `get_metrics` function. Custom metrics can be added or removed using `add_metric` and `remove_metric` function. # In[26]: best = compare_models() # `compare_models` by default uses all the estimators in model library (all except models with `Turbo=False`) . To see all available models you can use the function `models()` # In[27]: # check available models models() # You can use the `include` and `exclude` parameter in the `compare_models` to train only select model or exclude specific models from training by passing the model id's in `exclude` parameter. # In[28]: compare_ts_models = compare_models(include = ['ets', 'arima', 'theta', 'naive', 'snaive', 'grand_means', 'polytrend']) # In[29]: compare_ts_models # The function above has return trained model object as an output. The scoring grid is only displayed and not returned. If you need access to the scoring grid you can use `pull` function to access the dataframe. # In[30]: compare_ts_models_results = pull() compare_ts_models_results # By default `compare_models` return the single best performing model based on the metric defined in the `sort` parameter. Let's change our code to return 3 top models based on `MAE`. # In[31]: best_mae_models_top3 = compare_models(sort = 'R2', n_select = 3) # In[32]: # list of top 3 models by MAE best_mae_models_top3 # Some other parameters that you might find very useful in `compare_models` are: # # - fold # - cross_validation # - budget_time # - errors # - parallel # - engine # # You can check the docstring of the function for more info. # In[33]: # help(compare_models) # ## ✅ Check Stats # The `check_stats` function is used to get summary statistics and run statistical tests on the original data or model residuals. # In[34]: # check stats on original data check_stats() # In[35]: # check_stats on residuals of best model check_stats(estimator = best) # ## ✅ Experiment Logging # PyCaret integrates with many different type of experiment loggers (default = 'mlflow'). To turn on experiment tracking in PyCaret you can set `log_experiment` and `experiment_name` parameter. It will automatically track all the metrics, hyperparameters, and artifacts based on the defined logger. # In[36]: # from pycaret.time_series import * # s = setup(data, fh = 3, session_id = 123, log_experiment='mlflow', experiment_name='airline_experiment') # In[37]: # compare models # best = compare_models() # In[38]: # start mlflow server on localhost:5000 # !mlflow ui # By default PyCaret uses `MLFlow` logger that can be changed using `log_experiment` parameter. Following loggers are available: # # - mlflow # - wandb # - comet_ml # - dagshub # # Other logging related parameters that you may find useful are: # # - experiment_custom_tags # - log_plots # - log_data # - log_profile # # For more information check out the docstring of the `setup` function. # In[39]: # help(setup) # ## ✅ Create Model # This function trains and evaluates the performance of a given estimator using cross-validation. The output of this function is a scoring grid with CV scores by fold. Metrics evaluated during CV can be accessed using the `get_metrics` function. Custom metrics can be added or removed using `add_metric` and `remove_metric` function. All the available models can be accessed using the models function. # In[40]: # check all the available models models() # In[41]: # train ets with default fold=3 ets = create_model('ets') # The function above has return trained model object as an output. The scoring grid is only displayed and not returned. If you need access to the scoring grid you can use `pull` function to access the dataframe. # In[42]: ets_results = pull() print(type(ets_results)) ets_results # In[43]: # train theta model with fold=5 theta = create_model('theta', fold=5) # In[44]: # train theta with specific model parameters create_model('theta', deseasonalize = False, fold=5) # Some other parameters that you might find very useful in `create_model` are: # # - cross_validation # - engine # - fit_kwargs # # You can check the docstring of the function for more info. # In[45]: # help(create_model) # ## ✅ Tune Model # # The `tune_model` function tunes the hyperparameters of the model. The output of this function is a scoring grid with cross-validated scores by fold. The best model is selected based on the metric defined in optimize parameter. Metrics evaluated during cross-validation can be accessed using the `get_metrics` function. Custom metrics can be added or removed using `add_metric` and `remove_metric` function. # In[46]: # train a dt model with default params dt = create_model('dt_cds_dt') # In[47]: # tune hyperparameters of dt tuned_dt = tune_model(dt) # Metric to optimize can be defined in `optimize` parameter (default = 'MASE'). Also, a custom tuned grid can be passed with `custom_grid` parameter. # In[48]: dt # In[49]: # define tuning grid dt_grid = {'regressor__max_depth' : [None, 2, 4, 6, 8, 10, 12]} # tune model with custom grid and metric = MAE tuned_dt = tune_model(dt, custom_grid = dt_grid, optimize = 'MAE') # In[50]: # see tuned_dt params tuned_dt # In[51]: # to access the tuner object you can set return_tuner = True tuned_dt, tuner = tune_model(dt, return_tuner=True) # In[52]: # model object tuned_dt # In[53]: # tuner object tuner # For more details on all available `search_library` and `search_algorithm` please check the docstring. Some other parameters that you might find very useful in `tune_model` are: # # - choose_better # - custom_scorer # - n_iter # - search_algorithm # - optimize # # You can check the docstring of the function for more info. # In[54]: # help(tune_model) # ## ✅ Blend Models # This function trains a `EnsembleForecaster` for select models passed in the estimator_list parameter. The output of this function is a scoring grid with CV scores by fold. Metrics evaluated during CV can be accessed using the `get_metrics` function. Custom metrics can be added or removed using `add_metric` and `remove_metric` function. # In[55]: # top 3 models based on mae best_mae_models_top3 # In[56]: # blend top 3 models blend_models(best_mae_models_top3) # Some other parameters that you might find very useful in `blend_models` are: # # - choose_better # - method # - weights # - fit_kwargs # - optimize # # You can check the docstring of the function for more info. # In[57]: # help(blend_models) # ## ✅ Plot Model # This function analyzes the performance of a trained model on the hold-out set. It may require re-training the model in certain cases. # In[58]: # plot forecast plot_model(best, plot = 'forecast') # In[59]: # plot acf # for certain plots you don't need a trained model plot_model(plot = 'acf') # In[60]: # plot diagnostics # for certain plots you don't need a trained model plot_model(plot = 'diagnostics') # Some other parameters that you might find very useful in `plot_model` are: # # - fig_kwargs # - data_kwargs # - display_format # - return_fig # - return_data # - save # # You can check the docstring of the function for more info. # In[61]: # help(plot_model) # ## ✅ Finalize Model # This function trains a given model on the entire dataset including the hold-out set. # In[62]: final_best = finalize_model(best) # In[63]: final_best # ## ✅ Deploy Model # This function deploys the entire ML pipeline on the cloud. # # **AWS:** When deploying model on AWS S3, environment variables must be configured using the command-line interface. To configure AWS environment variables, type `aws configure` in terminal. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your amazon console account: # # - AWS Access Key ID # - AWS Secret Key Access # - Default Region Name (can be seen under Global settings on your AWS console) # - Default output format (must be left blank) # # **GCP:** To deploy a model on Google Cloud Platform ('gcp'), the project must be created using the command-line or GCP console. Once the project is created, you must create a service account and download the service account key as a JSON file to set environment variables in your local environment. Learn more about it: https://cloud.google.com/docs/authentication/production # # **Azure:** To deploy a model on Microsoft Azure ('azure'), environment variables for the connection string must be set in your local environment. Go to settings of storage account on Azure portal to access the connection string required. # AZURE_STORAGE_CONNECTION_STRING (required as environment variable) # Learn more about it: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?toc=%2Fpython%2Fazure%2FTOC.json # In[64]: # deploy model on aws s3 # deploy_model(best, model_name = 'my_first_platform_on_aws', # platform = 'aws', authentication = {'bucket' : 'pycaret-test'}) # In[65]: # load model from aws s3 # loaded_from_aws = load_model(model_name = 'my_first_platform_on_aws', platform = 'aws', # authentication = {'bucket' : 'pycaret-test'}) # loaded_from_aws # ## ✅ Save / Load Model # This function saves the transformation pipeline and a trained model object into the current working directory as a pickle file for later use. # In[66]: # save model save_model(best, 'my_first_model') # In[67]: # load model loaded_from_disk = load_model('my_first_model') loaded_from_disk # ## ✅ Save / Load Experiment # This function saves all the experiment variables on disk, allowing to later resume without rerunning the setup function. # In[68]: # save experiment save_experiment('my_experiment') # In[69]: # load experiment from disk exp_from_disk = load_experiment('my_experiment', data=data)