Using Isolation Forests as kernels for SVM

This is a short example about using the isotree library for fitting an isolation forest model and using this (unsupervised) fitted model for calculating the similarities between each pair of observations, which in turn can be used as a kernel for SVM (support vector machines) for supervised learning tasks.

By default, the library calculates a distance metric between observations which is bounded between zero and one. Having these bounds, it can be easily turned into a similarity metric by simply calculating one minus this distance. This similarity metric satisfies the properties of a Hilbert space (, being possible to use it as a kernel for support vector machines or as a feature generator.

The library includes a function set_reference_points which can be used for repeated distance calculations against the same points. Note that this is however a typically very slow and memory-heavy operation, and as such is not recommended for large datasets.

The example uses the "splice scale" dataset, which can be downloaded from LibSVM dataset. Note that, despite being offered in a sparse matrix format, the data is actually dense.

Reading the data

In [1]:
from readsparse import read_sparse

splice = read_sparse("splice_scale.txt")
y = (splice["y"] == 1).astype("float64")
X = splice["X"].toarray()
(1000, 60)

Defining a kernel transformer from isolation forest

In [2]:
from isotree import IsolationForest
from sklearn.base import TransformerMixin, BaseEstimator

class IsoDistKernel(TransformerMixin, BaseEstimator):
    def __init__(self, isotree_params: dict = {}):
        self.isotree_params = isotree_params
    def fit(self, X, y=None, sample_weights=None):
        self.iso_ = IsolationForest(**self.isotree_params), with_distances=True)
        return self
    def transform(self, X):
        D = self.iso_.predict_distance(X, use_reference_points=True)
        return 1 - D

Evaluating results with this kernel

Note that while typically most SVM libraries manually calculate a set of predefined kernels, software such as LibSVM (and by extension scikit-learn which uses it behind the hood) or ThunderSVM allow passing a precomputed kernel as data instead of the original points, which is a square matrix with dimension equal to the number of observations.

The results here are evaluated by a randomized and stratified 5-fold cross-validation.

In [3]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from pprint import pprint

model = make_pipeline(
cv_res_iso = cross_validate(
    X=X, y=y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
print("Cross-validation results (distance isolation kernel):")
print("Mean CV AUROC: %.4f" % cv_res_iso["test_score"].mean())
Cross-validation results (distance isolation kernel):
{'fit_time': array([0.11507702, 0.10557008, 0.11134267, 0.1177721 , 0.0980866 ]),
 'score_time': array([0.02970052, 0.0232563 , 0.02420211, 0.02198601, 0.03312016]),
 'test_score': array([0.96694712, 0.97225561, 0.96546892, 0.98168352, 0.96957262])}
Mean CV AUROC: 0.9712

A natural question is how good was the addition of this kernel compared to something simpler. As will be seen, results are better with the isolation kernel than with the default Gaussian RBF kernel used by this library:

In [4]:
### Compare against a simpler kernel
cv_res_plain_kernel = cross_validate(
    X=X, y=y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
print("Cross-validation results (default RBF kernel):")
print("Mean CV AUROC: %.4f" % cv_res_plain_kernel["test_score"].mean())
Cross-validation results (default RBF kernel):
{'fit_time': array([0.03547263, 0.0420773 , 0.04024625, 0.03484583, 0.03834677]),
 'score_time': array([0.01111746, 0.01246071, 0.01298881, 0.0111289 , 0.01078558]),
 'test_score': array([0.95633013, 0.94991987, 0.92773496, 0.96226604, 0.91942748])}
Mean CV AUROC: 0.9431

More efficient calculation for fitted model

While the input to the model is a kernel matrix with number of columns corresponding to the number of observations to which it was fitted, in practice SVM models only end up using a fraction of the total observations in their prediction formula (these are the so-called "support vectors").

As such, once one has a fitted model and wants to make predictions on new data, it is not necessary (nor beneficial) to calculate distances from the new observations to every single point that was in the training data - only distances with respect to support vectors are needed.

The software used here (scikit-learn) unfortunately does not have any option for automatically telling the model methods to "shrink" their input requirements, but it is nevertheless easy to re-create the formula manually:

In [5]:
iso = IsolationForest().fit(X).set_reference_points(X, with_distances=True)
K = 1 - iso.predict_distance(X, use_reference_points=True)
svm = SVC(kernel="precomputed").fit(K, y)
p_auto = svm.decision_function(K[:10]).reshape(-1)
print("Prediction from automated call to 'decision_function':")
Prediction from automated call to 'decision_function':
[ 1.84911618  1.247814    1.47801342 -0.41053304 -0.94761292  0.44522017
  0.31118329  1.50615986 -0.13875399  0.72681926]
In [6]:
idx_used = svm.support_
print("Number of reference points picked: %d" % idx_used.shape[0])
iso.set_reference_points(X[idx_used], with_distances=True)
K_used = 1. - iso.predict_distance(X[:10], use_reference_points=True)
p_manual = + svm.intercept_[0]
print("Prediction from manual formula using only selected reference points:")
Number of reference points picked: 559
Prediction from manual formula using only selected reference points:
[ 1.84911618  1.247814    1.47801342 -0.41053304 -0.94761292  0.44522017
  0.31118329  1.50615986 -0.13875399  0.72681926]

Sub-sampled kernel

While SVM models typically involve efficient optimization routines for square kernel matrices which end up identifying the best reference points (support vectors) to use in the final prediction formula, it is also possible to use the trick with a plain generalized linear model such as logistic regression by instead supplying features that are the kernels with respect to randomly-sampled points within the data.

This is faster to calculate as a kernel, but typically the results are not as good quality as when using a full square matrix, since the support vectors are randomly-chosen.

In [7]:
from sklearn.linear_model import LogisticRegression
import numpy as np

class IsoSubSampledDistKernel(TransformerMixin, BaseEstimator):
    def __init__(self, isotree_params: dict = {}, n_samples=250, random_state=None):
        self.isotree_params = isotree_params
        self.n_samples = n_samples
        self.random_state = random_state
    def fit(self, X, y=None, sample_weights=None):
        self.iso_ = IsolationForest(**self.isotree_params)
        rng = np.random.default_rng(seed=self.random_state)
        idx_random = rng.choice(X.shape[0], size=self.n_samples)
        self.iso_.set_reference_points(X[idx_random], with_distances=True)
        return self
    def transform(self, X):
        D = self.iso_.predict_distance(X, use_reference_points=True)
        return 1 - D

model_subsampled = make_pipeline(
    n_samples = 250, random_state=456),
    LogisticRegression(solver="lbfgs", max_iter=10000)
cv_res_iso_subsampled = cross_validate(
    X=X, y=y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
print("Cross-validation results (randomly sub-sampled distance isolation kernel):")
print("Mean CV AUROC: %.4f" % cv_res_iso_subsampled["test_score"].mean())
Cross-validation results (randomly sub-sampled distance isolation kernel):
{'fit_time': array([0.06880903, 0.10956693, 0.10929084, 0.12208724, 0.17555356]),
 'score_time': array([0.01918626, 0.0191586 , 0.01897311, 0.02019763, 0.03826642]),
 'test_score': array([0.96604567, 0.92938702, 0.94334901, 0.96416775, 0.93063757])}
Mean CV AUROC: 0.9467