Notebook

Orion Tulog¶

This is a tutorial for anomaly detection using Orion. Orion is a python package for time series anoamly detection. It provides a suite of both statistical and machine learning models that enable efficient anomaly detection.

In this tutorial, we will learn how to set up Orion, train a machine learning model, and perform anomaly detection. We will delve into each part seperately and then run the evaluation pipeline from beginning to end in order to compare multiple models against each other.

In [1]:

# general imports 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

from utils import plot, plot_ts, plot_rws, plot_error, unroll_ts

Part 1¶

In part one of the series, we explore a time series data, particularly the NYC taxi data. You can find the raw data on the TLC or the processed version maintained by Numenta here. We also explore what reasons could possibily be contributing to producing such anomalies.

Data Loading¶

There is a collection of data already available in Orion, to load them, we use the load_signal function and pass the name of the signal we wish to obtain. Similarly, since this data is labeled, we use the load_anomalies function to get the corresponding anomaly of the signal

In [2]:

from orion.data import load_signal, load_anomalies

In [3]:

signal = 'nyc_taxi'

# load signal
df = load_signal(signal)

# load ground truth anomalies
known_anomalies = load_anomalies(signal)

df.head(5)

Out[3]:

	timestamp	value
0	1404165600	10844.0
1	1404167400	8127.0
2	1404169200	6210.0
3	1404171000	4656.0
4	1404172800	3820.0

In [4]:

plot(df, known_anomalies)

/Users/sarah/Downloads/repos-to-trash/Orion/tutorials/tulog/utils.py:145: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(ylabels)

Part 2¶

In part two of the series, we look at anomaly detection through time series reconstruction, particularly using a GAN model. We go through a sequence of transformations and data preparation, as well as model training and prediction.

Orion API¶

We will use Orion to perform these sequence of actions. We will be emphasizing the usage of the TadGAN model which is a time series anomaly detection using GANs model. The model is specified in a json format accompanied with this notebook named tadgan.json. There are more pipelines defined within the repository including: ARIMA, LSTM, etc.

The Orion API is a simple interface that allows you to interact with anomaly detection pipeline. To train the model on the data, we simply use the fit method; to do anomaly detection, we use the detect method. In our case, we want to fit the data and then perform detection; therefore we use the fit_detect method. This might take some time to run. Once it’s done, we can visualize the results.

Note: the model might take some time to train. For experimentation purposes, you can reduce the number of epochs in the tadgan.json file such that you reduce the number of training iterations.

In [5]:

from orion import Orion


orion = Orion(
    pipeline='tadgan.json'
)

anomalies = orion.fit_detect(df)

/Users/sarah/opt/anaconda3/envs/orion-tf2/lib/python3.8/site-packages/sklearn/impute/_base.py:356: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version.
  warnings.warn(
2022-09-16 17:23:12.866379: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-16 17:23:12.904780: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f8ba19c7510 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-09-16 17:23:12.904806: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

Epoch: 1/5, Losses: {'cx_loss': -0.8796, 'cz_loss': -2.0101, 'eg_loss': 4.329}
Epoch: 2/5, Losses: {'cx_loss': -1.6189, 'cz_loss': 0.2361, 'eg_loss': -2.1686}
Epoch: 3/5, Losses: {'cx_loss': -1.1455, 'cz_loss': 0.1359, 'eg_loss': -2.5223}
Epoch: 4/5, Losses: {'cx_loss': -1.0701, 'cz_loss': 0.5063, 'eg_loss': -3.8923}
Epoch: 5/5, Losses: {'cx_loss': -0.6953, 'cz_loss': 3.0434, 'eg_loss': -8.6544}

Let's visualize the results.

In [6]:

plot(df, [anomalies, known_anomalies])
anomalies.head(5)

/Users/sarah/Downloads/repos-to-trash/Orion/tutorials/tulog/utils.py:145: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(ylabels)

Out[6]:

	start	end	severity
0	1404165600	1404372600	0.984310
1	1422097200	1422496800	0.150504

The red intervals depict the detected anomalies, the green intervals show the ground truth. Cool! the model was able to detect some anomalies. We also see that it detected some other intervals that were not included in the ground truth labels. It is clear though, they are falling out of shape with respect to the remaining signal. Note: the results might differ between runs.

We might have jumped straight to the results but let's trace back and look at what the model actually did.

There is a series of transformations happening to the data in order to obtain the result you have just seen. From data preprocessing, model training, to post-processing functionalities. We specify these functions, which we refer to as primitives, within the model’s .json file. What are these primitives? If we were to look at the tadgan.json model, we find these sequential primitives:

"primitives": [ 
    "mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate”,
    "sklearn.impute.SimpleImputer",
    "sklearn.preprocessing.MinMaxScaler",
    "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
    "orion.primitives.tadgan.TadGAN",
    "orion.primitives.tadgan.score_anomalies",
    "orion.primitives.timeseries_anomalies.find_anomalies"
]

Each primitive is responsible for a single task. We describe the procedure of each primitive in the remainder of this notebook

Data Preparation¶

A. data frequency¶

Adjust signal spacing to be of equal width across all times. There are two important parameters in this process:

interval: an interger that refers to the time span to compute aggregation of.
method: what aggregation method should be used to compute the value, by default this set to the mean.

In addition to, passing the array of values and which column holds the values we wish to alter the frequency of.

In [7]:

def time_segments_aggregate(X, interval, time_column, method=['mean']):
    """Aggregate values over given time span.
    Args:
        X (ndarray or pandas.DataFrame):
            N-dimensional sequence of values.
        interval (int):
            Integer denoting time span to compute aggregation of.
        time_column (int):
            Column of X that contains time values.
        method (str or list):
            Optional. String describing aggregation method or list of strings describing multiple
            aggregation methods. If not given, `mean` is used.
    Returns:
        ndarray, ndarray:
            * Sequence of aggregated values, one column for each aggregation method.
            * Sequence of index values (first index of each aggregated segment).
    """
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X)

    X = X.sort_values(time_column).set_index(time_column)

    if isinstance(method, str):
        method = [method]

    start_ts = X.index.values[0]
    max_ts = X.index.values[-1]

    values = list()
    index = list()
    while start_ts <= max_ts:
        end_ts = start_ts + interval
        subset = X.loc[start_ts:end_ts - 1]
        aggregated = [
            getattr(subset, agg)(skipna=True).values
            for agg in method
        ]
        values.append(np.concatenate(aggregated))
        index.append(start_ts)
        start_ts = end_ts

    return np.asarray(values), np.asarray(index)

X, index = time_segments_aggregate(df, interval=1800, time_column='timestamp')

If we go back to the source of the NYC Taxi data, we find that it records a value each 30 minutes. In the timestamp world, this is equivalent to 1800 seconds, therefore we set the interval to be 1800. We also opt for the default aggregation method which is taking the mean value of each interval.

Technically speaking, in our example the data is perfectly spaced, so we can skip this preprocessing step. However, that is not always the case and so we include it as a preprocessing primitive in the general pipeline as you will see later on.

B. data imputation¶

impute missing values that appear within the signal using scikit-learn's SimpleImputer which fills missing values by the mean value.

In [8]:

imp = SimpleImputer()
X = imp.fit_transform(X)

C. data normalization¶

normalize the data between a specific range, we use scikit-learn's MinMaxScaler to scale data between [-1, 1].

In [9]:

scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)

Notice how the y-axis changed after normalizing the data between [-1, 1]

In [10]:

plot_ts(X)

D. slice the data to rolling window¶

to prepare the data, we need to transform it into a sequence that is ingestable by the machine learning model. We take the signal we're interested in analyzing and we generate training examples. These training examples are mere snapshots of signal at different times.

In order to do that, we adopt the sliding window approach of choosing a window of a pre-specified width and a particular step size. Once that's been decided we divide the signal indo segments, similar to what is depicited in the illustration below.

We create a rolling_window_sequence function that slices the data into parts, each part contains:

the target value; the value at time t.
previous observed values, this is determined by the window width.

In [11]:

def rolling_window_sequences(X, index, window_size, target_size, step_size, target_column,
                             drop=None, drop_windows=False):
    """Create rolling window sequences out of time series data.
    The function creates an array of input sequences and an array of target sequences by rolling
    over the input sequence with a specified window.
    Optionally, certain values can be dropped from the sequences.
    Args:
        X (ndarray):
            N-dimensional sequence to iterate over.
        index (ndarray):
            Array containing the index values of X.
        window_size (int):
            Length of the input sequences.
        target_size (int):
            Length of the target sequences.
        step_size (int):
            Indicating the number of steps to move the window forward each round.
        target_column (int):
            Indicating which column of X is the target.
        drop (ndarray or None or str or float or bool):
            Optional. Array of boolean values indicating which values of X are invalid, or value
            indicating which value should be dropped. If not given, `None` is used.
        drop_windows (bool):
            Optional. Indicates whether the dropping functionality should be enabled. If not
            given, `False` is used.
    Returns:
        ndarray, ndarray, ndarray, ndarray:
            * input sequences.
            * target sequences.
            * first index value of each input sequence.
            * first index value of each target sequence.
    """
    out_X = list()
    out_y = list()
    X_index = list()
    y_index = list()
    target = X[:, target_column]

    if drop_windows:
        if hasattr(drop, '__len__') and (not isinstance(drop, str)):
            if len(drop) != len(X):
                raise Exception('Arrays `drop` and `X` must be of the same length.')
        else:
            if isinstance(drop, float) and np.isnan(drop):
                drop = np.isnan(X)
            else:
                drop = X == drop

    start = 0
    max_start = len(X) - window_size - target_size + 1
    while start < max_start:
        end = start + window_size

        if drop_windows:
            drop_window = drop[start:end + target_size]
            to_drop = np.where(drop_window)[0]
            if to_drop.size:
                start += to_drop[-1] + 1
                continue

        out_X.append(X[start:end])
        out_y.append(target[end:end + target_size])
        X_index.append(index[start])
        y_index.append(index[end])
        start = start + step_size

    return np.asarray(out_X), np.asarray(out_y), np.asarray(X_index), np.asarray(y_index)

X, y, X_index, y_index = rolling_window_sequences(X, index, 
                                                  window_size=100, 
                                                  target_size=1, 
                                                  step_size=1,
                                                  target_column=0)

In [12]:

print("Training data input shape: {}".format(X.shape))
print("Training data index shape: {}".format(X_index.shape))
print("Training y shape: {}".format(y.shape))
print("Training y index shape: {}".format(y_index.shape))

Training data input shape: (10222, 100, 1)
Training data index shape: (10222,)
Training y shape: (10222, 1)
Training y index shape: (10222,)

In [13]:

plot_rws(X)

Where X represents the input used to train the model. In the previous example, we see X has 10222 training data points. Notice that 100 represents the window size. On the other hand, y is the real signal after processing, which we will use later on to calculate the error between the reconstructed and real signal.

Pipeline Training and Detection¶

The architecture of the model requires four neural networks:

encoder: maps X to its latent representation Z.
generator: maps the latent variable Z back to X, which we will denote later on as X_hat.
criticX: discriminates between X and generator(Z) or X_hat.
criticZ: discriminates between Z and encoder(X).

we detail the composition of each network in model.py.

To use the TadGAN model, we specify a number of parameters including the model layers (structure of the previously mentioned neural networks). We also specify the input dimensions, the number of epochs, the learning rate, etc. All the parameters are listed below.

In [14]:

from model import hyperparameters
from orion.primitives.tadgan import TadGAN

hyperparameters["epochs"] = 5
hyperparameters["input_shape"] = (100, 1) # based on the window size
hyperparameters["optimizer"] = "keras.optimizers.Adam"
hyperparameters["learning_rate"] = 0.0005
hyperparameters["latent_dim"] = 20
hyperparameters["batch_size"] = 64

tgan = TadGAN(**hyperparameters)
tgan.fit(X)

Epoch: 1/5, Losses: {'cx_loss': array([-1.1872, -4.4056,  2.5281,  0.069 ]), 'cz_loss': array([-2.5995, -1.6567, -2.3138,  0.1371]), 'eg_loss': array([ 2.1651, -2.5728,  3.1177,  0.162 ])}
Epoch: 2/5, Losses: {'cx_loss': array([ -1.2376, -12.4034,  10.9903,   0.0175]), 'cz_loss': array([-2.1497, -3.525 ,  1.0386,  0.0337]), 'eg_loss': array([-10.828 , -11.0536,  -0.9568,   0.1182])}
Epoch: 3/5, Losses: {'cx_loss': array([-0.8272, -9.2085,  8.2797,  0.0102]), 'cz_loss': array([-2.3846, -4.2531,  1.6279,  0.0241]), 'eg_loss': array([-9.0235, -8.2781, -1.5691,  0.0824])}
Epoch: 4/5, Losses: {'cx_loss': array([-0.5854, -9.1109,  8.4216,  0.0104]), 'cz_loss': array([-2.6476, -3.9248,  1.005 ,  0.0272]), 'eg_loss': array([-8.5022, -8.2868, -0.9191,  0.0704])}
Epoch: 5/5, Losses: {'cx_loss': array([-4.8460e-01, -8.2309e+00,  7.6643e+00,  8.2000e-03]), 'cz_loss': array([-2.4598, -3.8494,  1.144 ,  0.0246]), 'eg_loss': array([-8.1353, -7.6542, -1.0893,  0.0608])}

In [15]:

# reconstruct
X_hat, critic = tgan.predict(X)

# visualize X_hat
plot_rws(X_hat)

To reassemble or “unroll” the predicted signal X_hat we can choose different aggregation methods (e.g., mean, max, etc). In our implementation, we chose it to as the median value.

In [16]:

# flatten the predicted windows
y_hat = unroll_ts(X_hat)

# plot the time series
plot_ts([y, y_hat], labels=['original', 'reconstructed'])

We can see that the GAN model did really well in trying to reconstruct the signal. We also see how it expected the signal to be, in comparison to what it actually is. The discrepancies between the two signals will be used to calculate the error. The higher the error, the more likely it is an anomaly

In [17]:

# pair-wise error calculation
error = np.zeros(shape=y.shape)
length = y.shape[0]
for i in range(length):
    error[i] = abs(y_hat[i] - y[i])

# visualize the error curve
fig = plt.figure(figsize=(30, 3))
plt.plot(error)
plt.show()

Error Computation¶

In the TadGAN pipeline, we use tadgan.score_anomalies to perform error calculation for us. It is a smoothed error function that uses a window based method to smooth the curve then uses either: area, point difference, or dtw as a measure of discrepancy.

Area¶

This method captures the general shape of the orignal and reconstructed signal and then compares them together.

Point¶

This method applies a point-to-point comparison between the original and reconstructed signal. It is considered a strict approach that does not allow for many mistakes.

DTW¶

A more lenient method yet very effective is Dynamic Time Warping (DTW). It compares two signals together using any pair-wise distance measure but it allows for one signal to be lagging behind another.

In [18]:

from orion.primitives.tadgan import score_anomalies

error, true_index, true, pred = score_anomalies(X, X_hat, critic, X_index, rec_error_type="dtw", comb="mult")
pred = np.array(pred).mean(axis=2)

# visualize the error curve
plot_error([[true, pred], error])

Now we can visually see where the error reaches a substantially high value. But how should we decide if the error value determines a potential anomaly? We could use a fixed threshold that says if error > 10 then let’s classify the datapoint as anomalous.

In [19]:

# threshold
thresh = 10

intervals = list()

i = 0
max_start = len(error)
while i < max_start:
    j = i
    start = index[i]
    while i < len(error) and error[i] > thresh:
        i += 1
    
    end = index[i]
    if start != end:
        intervals.append((start, end, np.mean(error[j: i+1])))
        
    i += 1
        
intervals

Out[19]:

[(1404541800, 1404592200, 10.302447289165059),
 (1404621000, 1404631800, 10.043972688121197),
 (1419429600, 1419652800, 18.762294512542237),
 (1422221400, 1422451800, 31.012344987856537)]

In [20]:

anomalies = pd.DataFrame(intervals, columns=['start', 'end', 'score'])
plot(df, [anomalies, known_anomalies])

/Users/sarah/Downloads/repos-to-trash/Orion/tutorials/tulog/utils.py:145: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(ylabels)

While a fixed threshold raised some correct anomalies, it missed out on others. If we were to look back at the error plot, we notice that some deviations are abnormal within its local region. So how can we incorporate this information in our thresholding technique? We can use window based methods to detect anomalies with respect to their context.

We first define the window of errors, that we want to analyze. We then find the anomalous sequences in that window by looking at the mean and standard deviation of the errors in the window. We store the start/stop index pairs that correspond to each sequence, along with its score. We then move the window and repeat the procedure. Lastly, we combine overlapping or consecutive sequences.

In [21]:

from orion.primitives.timeseries_anomalies import find_anomalies

# find anomalies
intervals = find_anomalies(error, index, 
                           window_size_portion=0.33, 
                           window_step_size_portion=0.1, 
                           fixed_threshold=True)
intervals

Out[21]:

array([[1.40441940e+09, 1.40474700e+09, 4.80894965e-01],
       [1.40943600e+09, 1.40968620e+09, 3.01127507e-01],
       [1.41471900e+09, 1.41497820e+09, 2.81347902e-01],
       [1.41697800e+09, 1.41726420e+09, 6.90210631e-01],
       [1.41933960e+09, 1.41969600e+09, 1.51006912e+00],
       [1.42218720e+09, 1.42248420e+09, 1.56241660e+00]])

In [22]:

# visualize the result
anomalies = pd.DataFrame(intervals, columns=['start', 'end', 'score'])
plot(df, [anomalies, known_anomalies])

/Users/sarah/Downloads/repos-to-trash/Orion/tutorials/tulog/utils.py:145: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(ylabels)

Cool! We now have the same result as we saw previously. The red intervals depict the detected anomalies, the green intervals show the ground truth. We also see that it detected some other intervals that were not included in the ground truth labels.

Using the Orion API and pipelines, we simplified this process yet allowed flexibility for pipeline configuration.

End-to-End Pipeline Configuration¶

To configure a pipeline, we adjust the parameters of the primitive of interest within the pipeline.json file or directly by passing the dictionary to the API.

In the following example, I changed the aggregation level as well as the number of epochs for training. These changes will override the parameters specified in the .json file. To know more about the API usage and primitive designs, please refer to the documentation.

In [23]:

from orion import Orion

hyperparameters = {
    "mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "interval": 3600 # hour level
    },
    'orion.primitives.tadgan.TadGAN#1': {
        'epochs': 5,
    }
}

orion = Orion(
    'tadgan.json',
    hyperparameters
)

anomalies = orion.fit_detect(df)

/Users/sarah/opt/anaconda3/envs/orion-tf2/lib/python3.8/site-packages/sklearn/impute/_base.py:356: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version.
  warnings.warn(

Epoch: 1/5, Losses: {'cx_loss': -0.7711, 'cz_loss': 4.9931, 'eg_loss': -2.9485}
Epoch: 2/5, Losses: {'cx_loss': -3.3203, 'cz_loss': -27.3137, 'eg_loss': 37.1111}
Epoch: 3/5, Losses: {'cx_loss': -6.424, 'cz_loss': -12.0811, 'eg_loss': 7.5079}
Epoch: 4/5, Losses: {'cx_loss': -9.01, 'cz_loss': 1.2478, 'eg_loss': -39.4212}
Epoch: 5/5, Losses: {'cx_loss': -2.7106, 'cz_loss': 2.0483, 'eg_loss': -52.451}

In [24]:

plot(df, [anomalies, known_anomalies])

/Users/sarah/Downloads/repos-to-trash/Orion/tutorials/tulog/utils.py:145: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(ylabels)

The anomalies detected in this run are a bit different from the earlier example. Although, it still succeeded in detecting anomalies. Maybe a 1 hour aggregate is not the appropriate value? Maybe we did not train the model enough times, or maybe too many times... How can we tell? One way is to look at the result of the model like we have done previously.

You can use the visualization parameter within detect to return intermediate outputs (primitive outputs) that we are interested in. For example in the tadgan.json file, use visualization to return the following variables:

X: this is the output of the preprocessing steps from averaging, imputing, and scaling. These steps were showcased previously as steps (A, B, and C).
X_hat: this is the "predicted" output by the TadGAN model without any processing. It represents the reconstructed window at each timepoint.
es: this is the error calculated by capturing the discrepancies between original and reconstructed signal.

then we use anomalies, viz = orion.detect(df, visualization=True) where viz will be a dictionary of intermediate outputs.

Note: we will talk more about how to evaluate the detected anommalies with respect to the ground truth in part 3 of the tutorial.

Part 3¶

In part three of the series, we look at evaluation and end-to-end anomaly detection evaluation.

Evaluation¶

We compare the anomalies given to us as ground truth labels to the detected anomalies. But first, we look at some of the mechanisms we have for evaluation, namely:

weighted segment: Assessing every datapoint in the detected anomalies with its counterpart in the ground truth.
overlap segment: Assesses the detected anomaly segment by seeing if we caught an overlap with the correct anomalies.

We will look at both approaches, but first let's construct a dummy dataset.

Let's assume that the signal starts at timestamp 1, and ends at timestamp 20. We can then see that the ground truth contains three anomalies, namely (5, 8), (12, 13), and (17, 18), where (i, j) expresses the starting timestamp i and ending timestamp j.

We can also see that, we detected two anomalies, namely (5, 8) and (12, 15). So how can we compare both sets?

In [25]:

import numpy as np

# to reproduce the same dummy signal
np.random.seed(0)

# dummy data
start, end = (1, 20)
signal = np.random.rand(end - start, 1)

ground_truth = [
    (5, 8),
    (12, 13),
    (17, 18)
]

anomalies = [
    (5, 8),
    (12, 15)
]

In [26]:

import matplotlib.pyplot as plt

time = range(start, end)
plt.plot(time, signal)

# ground truth
for i, (t1, t2) in enumerate(ground_truth):
    plt.axvspan(t1, t2+1, color="g", alpha=0.2, label="ground_truth")

# detected
for i, (t1, t2) in enumerate(anomalies):
    plt.axvspan(t1, t2+1, color="r", alpha=0.2, label="detected")

    
plt.title("Example")
plt.xlabel("Time")
plt.ylabel("value")
plt.show()

There are two approaches for comparing anomaly sets, as expressed earlier.

(1) Weighted Segment, a stricter method, it is valuable to use when you want to equalize the importance of detecting anomalies, and normal instances.

It first segments the signal into partitions based on the ground truth and detected sequences.
Then it makes a segment to segment comparison and records TP/FP/FN/TN accordingly.
The overall score is weighted by the duration (size) of each segment.

Visually, this operation is summarized by the illustration below.

we can use orion.evaluation subpackage to compute multiple metrics using the weighted segment approach. For example to compute the accuracy, we use contextual_accuracy(..., weighted=True). There are other metrics available, for reference checkout the orion.evaluation documentation.

In [27]:

from orion.evaluation.contextual import contextual_accuracy, contextual_f1_score

accuracy = contextual_accuracy(ground_truth, anomalies, start=start, end=end)
f1_score = contextual_f1_score(ground_truth, anomalies, start=start, end=end)

print("Accuracy score = {:0.3f}".format(accuracy))
print("F1 score = {:0.3f}".format(f1_score))

Accuracy score = 0.789
F1 score = 0.750

(2) Overlap Segment, a more lenient approach of evaluation. It takes the perspective of rewarding the system if it manages to alarm the user of a subset of an anomaly. More particularly, it records:

TP, if a ground truth segment overlaps with the detected segment.
FN, If the ground truth segment does not overlap any detected segments.
FP, If a detected segment does not overlap any labeled anomalous region.

This can be summarized by the illustration below.

Similarly, we can use the same metric functions, but this time we use the parameter weighted=False. Note: overlap segment approach, does not account for true negatives. Reason being, anomalies in time series data are rare and so "normal" instances will skew the value of the computed metric. Therefore, using this approach we cannot compute metrics such as the accuracy.

In [28]:

f1_score = contextual_f1_score(ground_truth, anomalies, start=start, end=end, weighted=False)

print("F1 score = {:0.3f}".format(f1_score))

F1 score = 0.800

Pipeline evaluation end-to-end¶

We integrate the evaluation suite into the Orion API, such that you can evaluate the pipeline on a dataset (with its labels) end-to-end.

Following part 2 introduction to the Orion API, we can create an orion instance the use its evaluate functionality. We support the method with the following arguments:

data, a pandas.DataFrame containing two columns: timestamp and value.
truth, a pandas.DataFrame containing two columns: start timestamp and end timestamp of ground truth labels
fit, a flag denoting whether to train the pipeline before evaluating it.

train_data, a pandas.DataFrame containing two columns: timestamp and value used to train the pipeline, if not given, the pipeline will be trained on data.

metrics, a list of metrics used to evaluate the pipeline.

In the previous part we went through how to train a pipeline and use it for anomaly detection, the focus now is on defining metrics and evaluating the performance of the pipeline.

metrics is list of function names that compares a ground truth labels against detected labels and returns a metric value. We have seen some functions of that sort, such as contextual_accuracy and contextual_f1_score. To construct our metrics list we select some of the predefined metrics in Orion, such as:

Accuracy
F1 Score
Precision
Recall

By default, we use the weighted segment approach, you can override metrics defined by specifying aa new metrics dictitonary.

In [29]:

from orion.data import load_signal, load_anomalies

metrics = [
    'f1',
    'recall',
    'precision',
]

signal = 'nyc_taxi'

# load signal
df = load_signal(signal)

# load ground truth anomalies
ground_truth = load_anomalies(signal)

scores = orion.evaluate(df, ground_truth, metrics=metrics)

In [30]:

scores

Out[30]:

f1           0.238148
recall       0.209709
precision    0.275511
dtype: float64