Scikit learn has a range of algorithms based on distances, including classifiers, regressors and clusterers. These can generally all be used with aeon distances in two ways.
metric
or kernel
parameterin the constructor or
2. Set the metric
or kernel
to precomputed in the constructor and pass a
pairwise distance matrix to fit
and predict
.
from sklearn.neighbors import (
KNeighborsClassifier,
KNeighborsRegressor,
KNeighborsTransformer,
NearestCentroid,
)
If we have a univariate problem stored as a 2D numpy shape
(n_cases,n_timepoints)
, we can apply these estimators directly,
but it is treating the data as vector based rather than time series.
If we try and use with an aeon style 3D numpy
(n_cases, n_channels, n_timepoints)
, they will crash. See the
data_formats for details on data storage.
from aeon.datasets import load_gunpoint
trainx1, trainy1 = load_gunpoint(split="TRAIN", return_type="numpy2D")
testx1, testy1 = load_gunpoint(split="TEST", return_type="numpy2D")
# Use directly on TSC data with standard scikit distances
knn = KNeighborsClassifier(metric="manhattan")
knn.fit(trainx1, trainy1)
print(
"KNN with manhatttan distance on 2D time series data first five ",
knn.predict(testx1)[:5],
)
trainx2, trainy2 = load_gunpoint(split="TRAIN")
testx2, testy2 = load_gunpoint(split="TEST")
print("Shape of train = ", trainx2.shape, "sklearn will crash")
try:
knn.fit(trainx2, trainy2)
except ValueError:
print(
"raises an ValueError: Found array with dim 3."
"KNeighborsClassifier expected <= 2."
)
KNN with manhatttan distance on 2D time series data first five ['1' '2' '2' '1' '1'] Shape of train = (50, 1, 150) sklearn will crash raises an ValueError: Found array with dim 3.KNeighborsClassifier expected <= 2.
We can use KNeighborsClassifier
with a callable distance function with an aeon
distance function, but the input must still be 2D numpy. We can also use them as
callables in other sklearn.neighbors
estimators
from aeon.distances import (
dtw_distance,
edr_distance,
erp_distance,
msm_distance,
twe_distance,
)
# Use directly on TSC data with aeon distance function
knn = KNeighborsClassifier(metric=dtw_distance)
knn.fit(trainx1, trainy1)
print(
"KNN with DTW on 2D time series data first five predictions= ",
knn.predict(testx1)[:5],
)
try:
knn.fit(trainx2, trainy2)
except ValueError:
print(
"raises an ValueError: Found array with dim 3. "
"KNeighborsClassifier expected <= 2."
)
nc = NearestCentroid(metric=erp_distance)
nc.fit(trainx1, trainy1)
print(
"nc with ERP on 2D time series data first five predictions= ",
nc.predict(testx1)[:5],
)
kt = KNeighborsTransformer(metric=edr_distance)
kt.fit(trainx1, trainy1)
print(
"nc with ERP on 2D time series data transform shape = ", kt.transform(testx1).shape
)
KNN with DTW on 2D time series data first five predictions= ['1' '2' '2' '1' '2'] raises an ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.
C:\Code\aeon\venv\lib\site-packages\sklearn\neighbors\_nearest_centroid.py:179: UserWarning: Averaging for metrics other than euclidean and manhattan not supported. The average is set to be the mean. warnings.warn(
nc with ERP on 2D time series data first five predictions= ['1' '1' '2' '1' '1'] nc with ERP on 2D time series data transform shape = (150, 50)
Also note that the callable will not work with some KNN algorithm
options such as
kd_tree
, which raises an error kd_tree does not support callable metric
. Because of
these problems, we have implemented a KNN classifier and regressor to use with our
distance functions.
from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
from aeon.datasets import load_covid_3month # Regression problem
from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor
knn1 = KNeighborsTimeSeriesClassifier(distance="msm")
knn1.fit(trainx1, trainy1)
print(
"KNN classification with MSM 3D time series data first five predictions=",
knn1.predict(testx1)[:5],
)
trainx3, trainy3 = load_covid_3month(split="train")
testx3, testy3 = load_covid_3month(split="test")
knn2 = KNeighborsTimeSeriesRegressor(distance="twe", n_neighbors=1)
knn2.fit(trainx3, trainy3)
print(
"aeon KNN regression with TWE first five predictions=",
knn2.predict(testx3)[:5],
)
knr = KNeighborsRegressor(metric=twe_distance, n_neighbors=1)
trainx4 = trainx3.squeeze()
testx4 = testx3.squeeze()
knr.fit(trainx4, trainy3)
print(
"sklearn KNN regression with TWE first five predictions=",
knr.predict(testx4)[:5],
)
KNN classification with MSM 3D time series data first five predictions= ['1' '2' '2' '1' '1'] aeon KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611] sklearn KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]
Another alternative is to pass the distance measures as precomputed.
Note that this requires the calculation of the full distance matricies,
and still cannot be used with algorithm
options.
from aeon.distances import euclidean_pairwise_distance
train_dists = euclidean_pairwise_distance(trainx2)
test_dists = euclidean_pairwise_distance(testx2, trainx2)
knn = KNeighborsClassifier(metric="precomputed")
knn.fit(train_dists, trainy2)
print("KNN precomputed=", knn.predict(test_dists)[:5])
KNN precomputed= ['1' '2' '2' '1' '1']
from sklearn.svm import SVC, SVR, NuSVC, NuSVR
The SVM estimators in scikit can be used with pairwise distance matrices. Please note that not all elastic distance functions are kernels, and it is desirable that they are for SVM. DTW is not a metric, but MSM and TWE are.
from aeon.distances import (
dtw_pairwise_distance,
msm_pairwise_distance,
twe_distance,
twe_pairwise_distance,
)
svc = SVC(kernel="precomputed")
svr = SVR(kernel="precomputed")
nsvc = NuSVC(kernel="precomputed")
nsvr = NuSVR(kernel=twe_distance)
train_m1 = twe_pairwise_distance(trainx1)
test_m1 = twe_pairwise_distance(testx1, trainx1)
svc.fit(train_m1, trainy1)
print("SVC with TWE first five predictions= ", svc.predict(test_m1)[:5])
train_m2 = msm_pairwise_distance(trainx2)
test_m2 = msm_pairwise_distance(testx2, trainx2)
nsvc.fit(train_m2, trainy2)
print("NuSVC with MSM first five predictions= ", nsvc.predict(test_m2)[:5])
train_m3 = dtw_pairwise_distance(trainx3)
test_m3 = dtw_pairwise_distance(testx3, trainx3)
svr.fit(train_m3, trainy3)
print("SVR with DTW first five predictions= ", svr.predict(test_m3)[:5])
SVC with TWE first five predictions= ['1' '2' '1' '2' '2'] NuSVC with MSM first five predictions= ['1' '2' '2' '1' '2'] SVR with DTW first five predictions= [0.08823529 0.08823529 0.08823529 0.08823529 0.08823529]
from sklearn.cluster import DBSCAN
Some sklearn clustering algorithms accept callable distances or precomputed distance matrices, and these can be used with aeon distance functions.
Note that DBSCAN can only make predictions on the train data, so it has no predict function.
db1 = DBSCAN(eps=2.5)
preds1 = db1.fit_predict(trainx1)
print(preds1[:5])
db2 = DBSCAN(metric=msm_distance, eps=2.5)
db3 = DBSCAN(metric="precomputed", eps=2.5)
preds2 = db2.fit_predict(trainx1)
print(preds1[:5])
preds3 = db3.fit_predict(train_m2)
print(preds1[:5])
[-1 0 0 0 0] [-1 0 0 0 0] [-1 0 0 0 0]
You can use pairwise distance functions within the scikit learn FunctionTransformer
wrapper
from sklearn.preprocessing import FunctionTransformer
from aeon.datasets import make_example_3d_numpy
from aeon.distances import msm_distance, msm_pairwise_distance
X = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10)
ft = FunctionTransformer(msm_pairwise_distance)
X2 = ft.transform(X)
print(" Shape = ", X2.shape)
d = msm_distance(X[0], X[1])
print(f"These should be the same {d} and {X2[0][1]}")
Shape = (5, 5) These should be the same 10.479268600222673 and 10.479268600222673
This makes it easy to use distances as features in an a scikit-learn pipeline. Whether it is a good idea to do this is a separate question!
from sklearn.pipeline import Pipeline
X, y = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10, return_y=True)
X2 = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10)
pipe = Pipeline(steps=[("FunctionTransformer", ft), ("SVM", SVC())])
pipe.fit(X, y)
pipe.predict(X2)
array([1, 1, 1, 1, 1])