Created using: PyCaret 2.2
Date Updated: November 25, 2020
Welcome to Anomaly Detection Tutorial (ANO101). This tutorial assumes that you are new to PyCaret and looking to get started with Anomaly Detection using pycaret.anomaly
Module.
In this tutorial we will learn:
Read Time : Approx. 25 Minutes
First step to get started with PyCaret is to install PyCaret. Installing PyCaret is easy and takes few minutes only. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google Colab, run the following code at top of your notebook to display interactive visuals.
Anomaly Detection is the task of identifying the rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. There are three broad categories of anomaly detection techniques exist:
pycaret.anomaly
module supports the unsupervised and supervised anomaly detection technique. In this tutorial we will only cover unsupervised anomaly detection technique.
PyCaret's anomaly detection module (pycaret.anomaly
) is a an unsupervised machine learning module which performs the task of identifying rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
PyCaret anomaly detection module provides several pre-processing features that can be configured when initializing the setup through setup()
function. It has over 12 algorithms and few plots to analyze the results of anomaly detection. PyCaret's anomaly detection module also implements a unique function tune_model()
that allows you to tune the hyperparameters of anomaly detection model to optimize the supervised learning objective such as AUC
for classification or R2
for regression.
For this tutorial we will use a dataset from UCI called Mice Protein Expression. The dataset consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. Click Here to read more about the dataset.
Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain. Email: clarahiguera@ucm.es
Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA. Email: katheleen.gardiner@ucdenver.edu
Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland. Email: kcios@vcu.edu
The original dataset and data dictionary can be found here.
You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using get_data()
function (This will require internet connection).
from pycaret.datasets import get_data
dataset = get_data('mice')
MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
5 rows × 82 columns
#check the shape of data
dataset.shape
(1080, 82)
In order to demonstrate the predict_model()
function on unseen data, a sample of 5% (54 samples) are taken out from original dataset to be used for predictions at the end of experiment. This should not be confused with train/test split. This particular split is performed to simulate real life scenario. Another way to think about this is that these 54 samples are not available at the time when this experiment was performed.
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)
setup()
function initializes the environment in PyCaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup()
must be called before executing any other function in PyCaret. It takes only one mandatory parameter: pandas dataframe. All other parameters are optional and are used to customize pre-processing pipeline (we will see them in later tutorials).
When setup()
is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. Although, most of the times the data type is inferred correctly but it's not always the case. Therefore, after setup()
is executed, PyCaret displays a table containing features and their inferred data types. At which stage, you can inspect and press enter
to continue if all data types are correctly inferred or type quit
to end the experiment. Identifying data types correctly is of fundamental importance in PyCaret as it automatically performs few pre-processing tasks which are imperative to perform any machine learning experiment. These pre-processing tasks are performed differently for each data type. As such, it is very important that data types are correctly configured.
In later tutorials we will learn how to overwrite PyCaret's inferred data types using numeric_features
and categorical_features
parameter in setup()
.
from pycaret.anomaly import *
exp_ano101 = setup(data, normalize = True,
ignore_features = ['MouseID'],
session_id = 123)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Original Data | (1026, 82) |
2 | Missing Values | True |
3 | Numeric Features | 77 |
4 | Categorical Features | 4 |
5 | Ordinal Features | False |
6 | High Cardinality Features | False |
7 | High Cardinality Method | None |
8 | Transformed Data | (1026, 91) |
9 | CPU Jobs | -1 |
10 | Use GPU | False |
11 | Log Experiment | False |
12 | Experiment Name | anomaly-default-name |
13 | USI | 85e2 |
14 | Imputation Type | simple |
15 | Iterative Imputation Iteration | None |
16 | Numeric Imputer | mean |
17 | Iterative Imputation Numeric Model | None |
18 | Categorical Imputer | mode |
19 | Iterative Imputation Categorical Model | None |
20 | Unknown Categoricals Handling | least_frequent |
21 | Normalize | True |
22 | Normalize Method | zscore |
23 | Transformation | False |
24 | Transformation Method | None |
25 | PCA | False |
26 | PCA Method | None |
27 | PCA Components | None |
28 | Ignore Low Variance | False |
29 | Combine Rare Levels | False |
30 | Rare Level Threshold | None |
31 | Numeric Binning | False |
32 | Remove Outliers | False |
33 | Outliers Threshold | None |
34 | Remove Multicollinearity | False |
35 | Multicollinearity Threshold | None |
36 | Clustering | False |
37 | Clustering Iteration | None |
38 | Polynomial Features | False |
39 | Polynomial Degree | None |
40 | Trignometry Features | False |
41 | Polynomial Threshold | None |
42 | Group Features | False |
43 | Feature Selection | False |
44 | Features Selection Threshold | None |
45 | Feature Interaction | False |
46 | Feature Ratio | False |
47 | Interaction Threshold | None |
Once the setup is successfully executed it prints the information grid that contains few important information. Much of the information is related to pre-processing pipeline which is constructed when setup()
is executed. Much of these features are out of scope for the purpose of this tutorial. However, few important things to note at this stage are:
session_id
is passed, a random number is automatically generated that is distributed to all functions. In this experiment session_id is set as 123
for later reproducibility.Notice that how few tasks such as missing value imputation and categorical encoding that are imperative to perform modeling are automatically handled. Most of the other parameters in setup()
are optional and used for customizing pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to intermediate and expert level, we will cover them in much detail.
Creating an anomaly detection model in PyCaret is simple and similar to how you would have created a model in supervised modules of PyCaret. The anomaly detection model is created using create_model()
function which takes one mandatory parameter i.e. name of the model as a string. This function returns a trained model object. See the example below:
iforest = create_model('iforest')
print(iforest)
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)
We have created Isolation Forest model using create_model()
. Notice the contamination
parameter is set 0.05
which is the default value when you do not pass fraction
parameter in create_model()
. fraction
parameter determines the proportion of outliers in the dataset. In below example, we will create One Class Support Vector Machine
model with 0.025
fraction.
svm = create_model('svm', fraction = 0.025)
print(svm)
Just by replacing iforest
with svm
inside create_model()
we have now created OCSVM
anomaly detection model. There are 12 models available ready-to-use in pycaret.anomaly
module. To see the complete list, please see docstring or use models
function.
models()
Name | Reference | |
---|---|---|
ID | ||
abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
cluster | Clustering-Based Local Outlier | pyod.models.cblof.CBLOF |
cof | Connectivity-Based Local Outlier | pyod.models.cof.COF |
iforest | Isolation Forest | pyod.models.iforest.IForest |
histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
lof | Local Outlier Factor | pyod.models.lof.LOF |
svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
pca | Principal Component Analysis | pyod.models.pca.PCA |
mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
sod | Subspace Outlier Detection | pyod.models.sod.SOD |
sos | Stochastic Outlier Selection | pyod.models.sos.SOS |
Now that we have created a model, we would like to assign the anomaly labels to our dataset (1080 samples) to analyze the results. We will achieve this by using assign_model()
function. See an example below:
iforest_results = assign_model(iforest)
iforest_results.head()
MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3501_12 | 0.344930 | 0.626194 | 0.383583 | 2.534561 | 4.097317 | 0.303547 | 0.222829 | 4.592769 | 0.239427 | ... | 0.252700 | 0.218868 | 0.249187 | 1.139493 | Ts65Dn | Memantine | S/C | t-SC-m | 0 | -0.014462 |
1 | 3520_5 | 0.630001 | 0.839187 | 0.357777 | 2.651229 | 4.261675 | 0.253184 | 0.185257 | 3.816673 | 0.204940 | ... | 0.155008 | 0.153219 | NaN | 1.642886 | Control | Memantine | C/S | c-CS-m | 0 | -0.070193 |
2 | 3414_13 | 0.555122 | 0.726229 | 0.278319 | 2.097249 | 2.897553 | 0.222222 | 0.174356 | 1.867880 | 0.203379 | ... | 0.136109 | 0.155530 | 0.185484 | 1.657670 | Ts65Dn | Memantine | C/S | t-CS-m | 0 | -0.070143 |
3 | 3488_8 | 0.275849 | 0.430764 | 0.285166 | 2.265254 | 3.250091 | 0.189258 | 0.157837 | 2.917611 | 0.202594 | ... | 0.127944 | 0.207671 | 0.175357 | 0.893598 | Control | Saline | S/C | c-SC-s | 0 | -0.080521 |
4 | 3501_7 | 0.304788 | 0.617299 | 0.335164 | 2.638236 | 4.876609 | 0.280590 | 0.199417 | 4.835421 | 0.236314 | ... | 0.245277 | 0.202171 | 0.240372 | 0.795637 | Ts65Dn | Memantine | S/C | t-SC-m | 0 | -0.064749 |
5 rows × 84 columns
Notice that two columns Label
and Score
are added towards the end. 0 stands for inliers and 1 for outliers/anomalies. Score is the values computed by the algorithm. Outliers are assigned with larger anomaly scores. Notice that iforest_results
also includes MouseID
feature that we have dropped during setup()
. It wasn't used for the model and is only appended to the dataset when you use assign_model()
. In the next section we will see how to analyze the results of anomaly detection using plot_model()
.
plot_model()
function can be used to analyze the anomaly detection model over different aspects. This function takes a trained model object and returns a plot. See the examples below:
plot_model(iforest)