Last updated: 15 Feb 2023
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.
Compared with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.
The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.
PyCaret is tested and supported on the following 64-bit systems:
You can install PyCaret with Python's pip package manager:
pip install pycaret
PyCaret's default installation will not install all the extra dependencies automatically. For that you will have to install the full version:
pip install pycaret[full]
or depending on your use-case you may install one of the following variant:
pip install pycaret[analysis]
pip install pycaret[models]
pip install pycaret[tuner]
pip install pycaret[mlops]
pip install pycaret[parallel]
pip install pycaret[test]
# check installed version
import pycaret
pycaret.__version__
'3.0.0'
PyCaret’s Anomaly Detection Module is an unsupervised machine learning module that is used for identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.
Typically, the anomalous items will translate to some kind of problems such as bank fraud, a structural defect, medical problems, or errors.
PyCaret's Anomaly Detection module provides several pre-processing features to prepare the data for modeling through the setup
function. It has over 10 ready-to-use algorithms and few plots to analyze the performance of trained models.
A typical workflow in PyCaret's unsupervised module consist of following 6 steps in this order:
Setup ➡️ Create Model ➡️ Assign Labels ➡️ Analyze Model ➡️ Prediction ➡️ Save Model
# loading sample dataset from pycaret dataset module
from pycaret.datasets import get_data
data = get_data('anomaly')
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 |
This function initializes the training environment and creates the transformation pipeline. The setup function must be called before executing any other function. It takes one mandatory parameter only: data. All the other parameters are optional.
# import pycaret anomaly and init setup
from pycaret.anomaly import *
s = setup(data, session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 7433 |
Once the setup has been successfully executed it shows the information grid containing experiment level information.
session_id
is passed, a random number is automatically generated that is distributed to all functions.PyCaret has two set of API's that you can work with. (1) Functional (as seen above) and (2) Object Oriented API.
With Object Oriented API instead of executing functions directly you will import a class and execute methods of class.
# import AnomalyExperiment and init the class
from pycaret.anomaly import AnomalyExperiment
exp = AnomalyExperiment()
# check the type of exp
type(exp)
pycaret.anomaly.oop.AnomalyExperiment
# init setup on exp
exp.setup(data, session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 013f |
<pycaret.anomaly.oop.AnomalyExperiment at 0x21f932a96d0>
You can use any of the two method i.e. Functional or OOP and even switch back and forth between two set of API's. The choice of method will not impact the results and has been tested for consistency.
This function trains an unsupervised anomaly detection model. All the available models can be accessed using the models function.
# train iforest model
iforest = create_model('iforest')
iforest
Processing: 0%| | 0/3 [00:00<?, ?it/s]
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
# to check all the available models
models()
Name | Reference | |
---|---|---|
ID | ||
abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
cluster | Clustering-Based Local Outlier | pyod.models.cblof.CBLOF |
cof | Connectivity-Based Local Outlier | pycaret.internal.patches.pyod.COFPatched |
iforest | Isolation Forest | pyod.models.iforest.IForest |
histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
lof | Local Outlier Factor | pyod.models.lof.LOF |
svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
pca | Principal Component Analysis | pyod.models.pca.PCA |
mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
sod | Subspace Outlier Detection | pycaret.internal.patches.pyod.SODPatched |
sos | Stochastic Outlier Selection | pycaret.internal.patches.pyod.SOSPatched |
This function assigns anomaly labels to the training data, given a trained model.
iforest_anomalies = assign_model(iforest)
iforest_anomalies
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.035865 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.084927 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.025356 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.042415 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.023408 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.305055 | 0.656837 | 0.331665 | 0.822525 | 0.907127 | 0.882276 | 0.855732 | 0.584786 | 0.808640 | 0.242762 | 0 | -0.083981 |
996 | 0.812627 | 0.864258 | 0.616604 | 0.167966 | 0.811223 | 0.938071 | 0.418462 | 0.472306 | 0.348347 | 0.671129 | 0 | -0.075839 |
997 | 0.250967 | 0.138627 | 0.919703 | 0.461234 | 0.886555 | 0.869888 | 0.800908 | 0.530324 | 0.779433 | 0.234952 | 0 | -0.052903 |
998 | 0.502436 | 0.936820 | 0.580062 | 0.540773 | 0.151995 | 0.059452 | 0.225220 | 0.242755 | 0.279385 | 0.538755 | 0 | -0.075104 |
999 | 0.457991 | 0.017755 | 0.714113 | 0.125992 | 0.063316 | 0.154739 | 0.922974 | 0.692299 | 0.816777 | 0.307592 | 0 | -0.008665 |
1000 rows × 12 columns
You can use the plot_model
function to analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.
# tsne plot anomalies
plot_model(iforest, plot = 'tsne')
# check docstring to see available plots
# help(plot_model)
An alternate to plot_model
function is evaluate_model
. It can only be used in Notebook since it uses ipywidget.
evaluate_model(iforest)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…
The predict_model
function returns Anomaly
and Anomaly_Score
label as a new column in the input dataframe. This step may or may not be needed depending on the use-case. Some times clustering models are trained for analysis purpose only and the interest of user is only in assigned labels on the training dataset, that can be done using assign_model
function. predict_model
is only useful when you want to obtain cluster labels on unseen data (i.e. data that was not used during training the model).
# predict on test set
iforest_pred = predict_model(iforest, data=data)
iforest_pred
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.035865 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.084927 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.025356 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.042415 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.023408 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.305055 | 0.656837 | 0.331665 | 0.822525 | 0.907127 | 0.882276 | 0.855732 | 0.584786 | 0.808640 | 0.242762 | 0 | -0.083981 |
996 | 0.812627 | 0.864258 | 0.616604 | 0.167966 | 0.811223 | 0.938071 | 0.418462 | 0.472306 | 0.348347 | 0.671129 | 0 | -0.075839 |
997 | 0.250967 | 0.138627 | 0.919703 | 0.461234 | 0.886555 | 0.869888 | 0.800908 | 0.530324 | 0.779433 | 0.234952 | 0 | -0.052903 |
998 | 0.502436 | 0.936820 | 0.580062 | 0.540773 | 0.151995 | 0.059452 | 0.225220 | 0.242755 | 0.279385 | 0.538755 | 0 | -0.075104 |
999 | 0.457991 | 0.017755 | 0.714113 | 0.125992 | 0.063316 | 0.154739 | 0.922974 | 0.692299 | 0.816777 | 0.307592 | 0 | -0.008665 |
1000 rows × 12 columns
The same function works for predicting the labels on unseen dataset. Let's create a copy of original data and drop the Class variable
. We can then use the new data frame without labels for scoring.
Finally, you can save the entire pipeline on disk for later use, using pycaret's save_model
function.
# save pipeline
save_model(iforest, 'iforest_pipeline')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))]), 'iforest_pipeline.pkl')
# load pipeline
loaded_iforest_pipeline = load_model('iforest_pipeline')
loaded_iforest_pipeline
Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))])
TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())
SimpleImputer()
SimpleImputer()
TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))
SimpleImputer(strategy='most_frequent')
SimpleImputer(strategy='most_frequent')
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
This function initializes the training environment and creates the transformation pipeline. The setup function must be called before executing any other function. It takes one mandatory parameter only: data. All the other parameters are optional.
s = setup(data, session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | f822 |
To access all the variables created by the setup function such as transformed dataset, random_state, etc. you can use get_config
method.
# check all available config
get_config()
{'USI', 'X', 'X_train', 'X_train_transformed', 'X_transformed', '_available_plots', '_ml_usecase', 'data', 'dataset', 'dataset_transformed', 'exp_id', 'exp_name_log', 'gpu_n_jobs_param', 'gpu_param', 'html_param', 'idx', 'is_multiclass', 'log_plots_param', 'logging_param', 'memory', 'n_jobs_param', 'pipeline', 'seed', 'train', 'train_transformed', 'variable_and_property_keys', 'variables'}
# lets access X_train_transformed
get_config('X_train_transformed')
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.305055 | 0.656837 | 0.331665 | 0.822525 | 0.907127 | 0.882276 | 0.855732 | 0.584786 | 0.808640 | 0.242762 |
996 | 0.812627 | 0.864258 | 0.616604 | 0.167966 | 0.811223 | 0.938071 | 0.418462 | 0.472306 | 0.348347 | 0.671129 |
997 | 0.250967 | 0.138627 | 0.919703 | 0.461234 | 0.886555 | 0.869888 | 0.800908 | 0.530324 | 0.779433 | 0.234952 |
998 | 0.502436 | 0.936820 | 0.580062 | 0.540773 | 0.151995 | 0.059452 | 0.225220 | 0.242755 | 0.279385 | 0.538755 |
999 | 0.457991 | 0.017755 | 0.714113 | 0.125992 | 0.063316 | 0.154739 | 0.922974 | 0.692299 | 0.816777 | 0.307592 |
1000 rows × 10 columns
# another example: let's access seed
print("The current seed is: {}".format(get_config('seed')))
# now lets change it using set_config
set_config('seed', 786)
print("The new seed is: {}".format(get_config('seed')))
The current seed is: 123 The new seed is: 786
All the preprocessing configurations and experiment settings/parameters are passed into the setup
function. To see all available parameters, check the docstring:
# help(setup)
# init setup with bin_numeric_feature
s = setup(data, session_id = 123,
bin_numeric_features=['Col1'])
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 9b1e |
# lets check the X_train_transformed to see effect of params passed
get_config('X_train_transformed')['Col1'].hist()
<AxesSubplot:>
Notice that Col1
originally was a numeric feature with a continious distribution. After transformation it is now converted into categorical feature. We can also access non-transformed values using get_config
and then compare the differences.
get_config('X_train')['Col1'].hist()
<AxesSubplot:>
PyCaret integrates with many different type of experiment loggers (default = 'mlflow'). To turn on experiment tracking in PyCaret you can set log_experiment
and experiment_name
parameter. It will automatically track all the metrics, hyperparameters, and artifacts based on the defined logger.
# from pycaret.anomaly import *
# s = setup(data, session_id = 123, log_experiment='mlflow', experiment_name='anomaly_project')
# train iforest
# iforest = create_model('iforest')
# start mlflow server on localhost:5000
# !mlflow ui
By default PyCaret uses MLFlow
logger that can be changed using log_experiment
parameter. Following loggers are available:
- mlflow
- wandb
- comet_ml
- dagshub
Other logging related parameters that you may find useful are:
For more information check out the docstring of the setup
function.
# help(setup)
This function trains an unsupervised anomaly detection model. All the available models can be accessed using the models function.
# check all the available models
models()
Name | Reference | |
---|---|---|
ID | ||
abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
cluster | Clustering-Based Local Outlier | pyod.models.cblof.CBLOF |
cof | Connectivity-Based Local Outlier | pycaret.internal.patches.pyod.COFPatched |
iforest | Isolation Forest | pyod.models.iforest.IForest |
histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
lof | Local Outlier Factor | pyod.models.lof.LOF |
svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
pca | Principal Component Analysis | pyod.models.pca.PCA |
mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
sod | Subspace Outlier Detection | pycaret.internal.patches.pyod.SODPatched |
sos | Stochastic Outlier Selection | pycaret.internal.patches.pyod.SOSPatched |
# train iforest model
iforest = create_model('iforest')
Processing: 0%| | 0/3 [00:00<?, ?it/s]
iforest
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
# train iforest with specific model parameter
create_model('iforest', contamination = 0.1)
Processing: 0%| | 0/3 [00:00<?, ?it/s]
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
# help(create_model)
This function assigns anomaly labels to the dataset for a given model. (1 = outlier, 0 = inlier).
iforest_results = assign_model(iforest)
iforest_results
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.024763 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.083637 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.021481 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.044031 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.026150 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.305055 | 0.656837 | 0.331665 | 0.822525 | 0.907127 | 0.882276 | 0.855732 | 0.584786 | 0.808640 | 0.242762 | 0 | -0.076718 |
996 | 0.812627 | 0.864258 | 0.616604 | 0.167966 | 0.811223 | 0.938071 | 0.418462 | 0.472306 | 0.348347 | 0.671129 | 0 | -0.059275 |
997 | 0.250967 | 0.138627 | 0.919703 | 0.461234 | 0.886555 | 0.869888 | 0.800908 | 0.530324 | 0.779433 | 0.234952 | 0 | -0.058574 |
998 | 0.502436 | 0.936820 | 0.580062 | 0.540773 | 0.151995 | 0.059452 | 0.225220 | 0.242755 | 0.279385 | 0.538755 | 0 | -0.089169 |
999 | 0.457991 | 0.017755 | 0.714113 | 0.125992 | 0.063316 | 0.154739 | 0.922974 | 0.692299 | 0.816777 | 0.307592 | 0 | -0.008304 |
1000 rows × 12 columns
# help(assign_model)
# tsne plot of anomalies
plot_model(iforest, plot = 'tsne')
# umap plot of anomalies (you need to install umap library for this separately)
# plot_model(iforest, plot = 'umap')
# help(plot_model)
This function deploys the entire ML pipeline on the cloud.
AWS: When deploying model on AWS S3, environment variables must be configured using the command-line interface. To configure AWS environment variables, type aws configure
in terminal. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your amazon console account:
GCP: To deploy a model on Google Cloud Platform ('gcp'), the project must be created using the command-line or GCP console. Once the project is created, you must create a service account and download the service account key as a JSON file to set environment variables in your local environment. Learn more about it: https://cloud.google.com/docs/authentication/production
Azure: To deploy a model on Microsoft Azure ('azure'), environment variables for the connection string must be set in your local environment. Go to settings of storage account on Azure portal to access the connection string required. AZURE_STORAGE_CONNECTION_STRING (required as environment variable) Learn more about it: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?toc=%2Fpython%2Fazure%2FTOC.json
# deploy model on aws s3
# deploy_model(iforest, model_name = 'my_first_platform_on_aws',
# platform = 'aws', authentication = {'bucket' : 'pycaret-test'})
# load model from aws s3
# loaded_from_aws = load_model(model_name = 'my_first_platform_on_aws', platform = 'aws',
# authentication = {'bucket' : 'pycaret-test'})
# loaded_from_aws
This function saves the transformation pipeline and a trained model object into the current working directory as a pickle file for later use.
# save model
save_model(iforest, 'my_first_model')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('bin_numeric_features', TransformerWrapper(include=['Col1'], transformer=KBinsDiscretizer(encode='ordinal', strategy='kmeans'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))]), 'my_first_model.pkl')
# load model
loaded_from_disk = load_model('my_first_model')
loaded_from_disk
Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('bin_numeric_features', TransformerWrapper(include=['Col1'], transformer=KBinsDiscretizer(encode='ordinal', strategy='kmeans'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(memory=FastMemory(location=C:\Users\owner\AppData\Local\Temp\joblib), steps=[('numerical_imputer', TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())), ('categorical_imputer', TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))), ('bin_numeric_features', TransformerWrapper(include=['Col1'], transformer=KBinsDiscretizer(encode='ordinal', strategy='kmeans'))), ('trained_model', IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0))])
TransformerWrapper(include=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8', 'Col9', 'Col10'], transformer=SimpleImputer())
SimpleImputer()
SimpleImputer()
TransformerWrapper(include=[], transformer=SimpleImputer(strategy='most_frequent'))
SimpleImputer(strategy='most_frequent')
SimpleImputer(strategy='most_frequent')
TransformerWrapper(include=['Col1'], transformer=KBinsDiscretizer(encode='ordinal', strategy='kmeans'))
KBinsDiscretizer(encode='ordinal', strategy='kmeans')
KBinsDiscretizer(encode='ordinal', strategy='kmeans')
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
This function saves all the experiment variables on disk, allowing to later resume without rerunning the setup function.
# save experiment
save_experiment('my_experiment')
# load experiment from disk
exp_from_disk = load_experiment('my_experiment', data=data)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 40f2 |