#!/usr/bin/env python # coding: utf-8 # `KDD2024 Tutorial / A Hands-On Introduction to Time Series Classification and Regression` # # Feature-based Time Series Machine Learning in `aeon` # # Feature-based classifiers and regressors are a popular theme in time series classification (TSC) and regression (TSER). The feature-based learners we provide are simply pipelines of transform and classifier/regressor. They extract descriptive statistics as features from time series to be used in a base estimator. Several toolkits exist for extracting features from time series data, in the first half of this notebook we will introduce a few available in `aeon` and explore them using our EEG example dataset. # # In the second half of the notebook, we will introduce the pipelining utilities available in `aeon` and demonstrate how to build a learner using your own selection of feature extraction transformation algorithm and estimator. Being a `scikit-learn` compatible library, we will also show how the utilities present there can also be used as a substitution for the `aeon` ones. # Pipeline classifier. # ## Table of Contents # # * [Load example data](#load-data) # * [Simple summary statistics](#summary-statistics) # * [Transforming summary statistics](#summary-statistics-transform) # * [Classification and regression with summary statistics](#summary-statistics-learner) # * [Catch22](#catch22) # * [Transforming Catch22](#catch22-transform) # * [Classification and regression with Catch22](#catch22-learner) # * [TSFresh](#tsfresh) # * [Transforming TSFresh](#tsfresh-transform) # * [Classification and regression with TSFresh](#tsfresh-learner) # * [Performance on the UCR univariate classification datasets](#evaluation) # * [Composable pipelines](#pipelines) # * [`aeon` pipelines](#aeon-pipelines) # * [`scikit-learn` pipelines](#sklearn-pipelines) # * [References](#references) # In[ ]: get_ipython().system('pip install aeon==0.11.0 tsfresh') get_ipython().system('mkdir -p data') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_MTSC_TRAIN.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_MTSC_TEST.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_UTSC_TRAIN.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_UTSC_TEST.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_MTSER_TRAIN.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_MTSER_TEST.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_UTSER_TRAIN.ts -P data/') get_ipython().system('wget -nc https://raw.githubusercontent.com/aeon-tutorials/KDD-2024/main/Notebooks/data/KDD_UTSER_TEST.ts -P data/') # In[1]: # There are some deprecation warnings present in the notebook, we will ignore them. # Remove this cell if you are interested in finding out what is changing soon, for # aeon there will be big changes in out v1.0.0 release! import warnings warnings.filterwarnings("ignore", category=FutureWarning) # In[2]: from aeon.registry import all_estimators all_estimators( "classifier", filter_tags={"algorithm_type": "feature"}, as_dataframe=True ) # In[3]: all_estimators( "regressor", filter_tags={"algorithm_type": "feature"}, as_dataframe=True ) # ## Load example data # In[4]: from aeon.datasets import load_from_tsfile X_train_c, y_train_c = load_from_tsfile("./data/KDD_MTSC_TRAIN.ts") X_test_c, y_test_c = load_from_tsfile("./data/KDD_MTSC_TEST.ts") print("Train shape:", X_train_c.shape) print("Test shape:", X_test_c.shape) # In[5]: from aeon.visualisation import plot_collection_by_class plot_collection_by_class(X_train_c[:,2,:], y_train_c) # In[6]: X_train_r, y_train_r = load_from_tsfile("./data/KDD_MTSER_TRAIN.ts") X_test_r, y_test_r = load_from_tsfile("./data/KDD_MTSER_TEST.ts") print("Train shape:", X_train_r.shape) print("Test shape:", X_test_r.shape) # In[7]: from matplotlib import pyplot as plt plt.plot(X_train_r[0].T) plt.legend(["Dim 0", "Dim 1", "Dim 2", "Dim 3"]) # ## Simple summary statistics # # One of the simplest ways to transform time series data is to calculate simple summary statistics such as the mean, median, minimum and maximum. While unlikely to be the most effective method for classification or regression, simple statistics can be an efficient approach if it is all that is required to solve a problem. It also serves as a useful baseline to compare more complex methods against. # ### Transforming summary statistics # # There are a myriad of simple summary statistics to extract. `aeon` provides the `SevenNumberSummaryTransformer` to extract some common sets of simple summary statistics. By default, this will extract the mean, standard deviation, minimum and maximum as well as the 25%, 50% and 75% percentiles of the series. # # In[8]: from aeon.transformations.collection.feature_based import SevenNumberSummaryTransformer sns = SevenNumberSummaryTransformer() sns.fit_transform(X_train_c)[:5] # ### Classification and regression with summary statistics # # The `SummaryClassifier` and `SummaryRegressor` `aeon` classes are wrappers for a pipeline of a `SevenNumberSummaryTransformer` transformation and a `scikit-learn` Random Forest by default. # # In[9]: from aeon.classification.feature_based import SummaryClassifier from sklearn.metrics import accuracy_score sns_cls = SummaryClassifier(random_state=42) sns_cls.fit(X_train_c, y_train_c) sns_preds_c = sns_cls.predict(X_test_c) accuracy_score(y_test_c, sns_preds_c) # The summary feature set and the `scikit-learn` estimator are configurable. # In[10]: from sklearn.linear_model import RidgeClassifierCV sns_cls = SummaryClassifier( summary_stats="bowley", estimator=RidgeClassifierCV(), random_state=42, ) sns_cls.fit(X_train_c, y_train_c) sns_preds_c = sns_cls.predict(X_test_c) accuracy_score(y_test_c, sns_preds_c) # Regression is run similarly to classification. # In[11]: from aeon.regression.feature_based import SummaryRegressor from sklearn.metrics import mean_squared_error sns_reg = SummaryRegressor(random_state=42) sns_reg.fit(X_train_r, y_train_r) sns_preds_r = sns_reg.predict(X_test_r) mean_squared_error(y_test_r, sns_preds_r) # In[12]: from aeon.visualisation import plot_scatter_predictions plot_scatter_predictions(y_test_r, sns_preds_r) # ## Catch22 # # The highly comparative time-series analysis (hctsa) [[1,2]](#references) toolbox can create over 7700 features for exploratory time series analysis. Unfortunetly there is not a complete set of these features available in Python currently, interested users can check out the relevant publication and the original MATLAB toolbox. # # The canonical time series characteristics (Catch22) [[3]](#references) are 22 hctsa features determined to be the most discriminatory of the full set. # # The Catch22 features were chosen by an evaluation on the UCR datasets [[4]](#references). The hctsa features were initially pruned, removing those which are sensitive to mean and variance and any which could not be calculated on over 80\% of the UCR datasets. A feature evaluation was then performed based on predictive performance. Any features which performed below a threshold were removed. For the remaining features, a hierarchical clustering was performed on the correlation matrix to remove redundancy. From each of the 22 clusters formed, a single feature was selected, taking into account balanced accuracy, computational efficiency and interpretability. The Catch22 features cover a wide range of concepts such as basic statistics of data series values, linear correlations, and entropy. # Catch22 extraction process. # ### Transforming Catch22 # # The `Catch22` class transforms time series into the 22 features. # # In[13]: from aeon.transformations.collection.feature_based import Catch22 c22 = Catch22() c22.fit_transform(X_train_c)[:5] # The order of the columns matches the `feature_names` list below for each channel. # In[14]: from aeon.transformations.collection.feature_based._catch22 import feature_names feature_names # The transform is configurable, and you can select a subset of the `features` to extract. The `catch24` parameter will include the mean and standard deviation as well as the original 22 features. # # `use_pycatch22` will use the `pycatch22` C-wrapped library by the original authors [[5]](#references) to extract features if it is installed. # In[15]: c22 = Catch22( features=["DN_HistogramMode_5", "CO_f1ecac", "SB_MotifThree_quantile_hh", "Mean", "StandardDeviation"], catch24=True, ) c22.fit_transform(X_train_c)[:5] # ### Classification and regression with Catch22 # # The `Catch22Classifier` and `Catch22Regressor` classes in `aeon` are simply convenient wrappers for a pipeline of a `Catch22` transformation and a `scikit-learn` Random Forest by default. # In[16]: from aeon.classification.feature_based import Catch22Classifier from sklearn.metrics import accuracy_score c22_cls = Catch22Classifier(random_state=42) c22_cls.fit(X_train_c, y_train_c) c22_preds_c = c22_cls.predict(X_test_c) accuracy_score(y_test_c, c22_preds_c) # The estimators are also configurable like the `Catch22` transformation. The `estimator` parameter is the estimator to use after transforming the features. # In[17]: from aeon.classification.sklearn import RotationForestClassifier c22_cls = Catch22Classifier( features=["DN_HistogramMode_5", "CO_f1ecac", "SB_MotifThree_quantile_hh", "Mean", "StandardDeviation"], estimator=RotationForestClassifier(), random_state=42, ) c22_cls.fit(X_train_c, y_train_c) c22_preds_c = c22_cls.predict(X_test_c) accuracy_score(y_test_c, c22_preds_c) # Regression is run similarly to classification. # In[18]: from aeon.regression.feature_based import Catch22Regressor from sklearn.metrics import mean_squared_error c22_reg = Catch22Regressor(random_state=42) c22_reg.fit(X_train_r, y_train_r) c22_preds_r = c22_reg.predict(X_test_r) mean_squared_error(y_test_r, c22_preds_r) # In[19]: from aeon.visualisation import plot_scatter_predictions plot_scatter_predictions(y_test_r, c22_preds_r) # ## TSFresh # # Time series feature extraction based on scalable hypothesis tests (TSFresh) [[6]](#references) is a collection of just under 800 features extracted from time series. The `aeon` implementation is a wrapper of the `tsfresh` package, and we recommend exploring the original packages documentation for more information on the feature extraction process [[7]](#references). # # __Note:__ You will need to `pip install tsfresh` to run this code. # TSFresh logo. # ### Transforming TSFresh # # There are two `aeon` transformation classes for TSFresh. `TSFreshFeatureExtractor` extracts all features, while `TSFreshRelevantFeatureExtractor` extracts only the features that are relevant to the target class using the FRESH algorithm for selection. # In[20]: from aeon.transformations.collection.feature_based import TSFreshFeatureExtractor tsf = TSFreshFeatureExtractor() t = tsf.fit_transform(X_train_c) t.shape # In[21]: t[:5] # There are multiple feature sets to extract, and functionality to extract specific features from them. The available feature sets are:`"minimal"`. `"efficient"` and `"comprehensive"`. # In[22]: from aeon.transformations.collection.feature_based import TSFreshFeatureExtractor tsf = TSFreshFeatureExtractor(default_fc_parameters="minimal") t = tsf.fit_transform(X_train_c) t.shape # In[23]: t[:5] # The FRESH algorithm can be used for feature extraction with `TSFreshRelevantFeatureExtractor`. # In[24]: from aeon.transformations.collection.feature_based import TSFreshRelevantFeatureExtractor tsf = TSFreshRelevantFeatureExtractor() t = tsf.fit_transform(X_train_c, y_train_c) t.shape # In[25]: t[:5] # ### Classification and regression with TSFresh # # Like the previous feature transforms, the `TSFreshClassifier` and `TSFreshRegressor` classes in `aeon` are wrappers for a pipeline of a `TSFreshFeatureExtractor` transformation and a `scikit-learn` Random Forest by default. The default setting is to use the `"efficient"` feature set with feature selection. # In[26]: from aeon.classification.feature_based import TSFreshClassifier from sklearn.metrics import accuracy_score tsf_cls = TSFreshClassifier(random_state=42) tsf_cls.fit(X_train_c, y_train_c) tsf_preds_c = tsf_cls.predict(X_test_c) accuracy_score(y_test_c, tsf_preds_c) # Feature selection and feature sets to extract can be configured for the TSFresh transformer. # In[27]: tsf_cls = TSFreshClassifier( default_fc_parameters="minimal", relevant_feature_extractor=False, random_state=42 ) tsf_cls.fit(X_train_c, y_train_c) tsf_preds_c = tsf_cls.predict(X_test_c) accuracy_score(y_test_c, tsf_preds_c) # Regression is run similarly to classification, with feature selection available for both learning tasks. # In[28]: from aeon.regression.feature_based import TSFreshRegressor tsf_reg = TSFreshRegressor(random_state=42) tsf_reg.fit(X_train_r, y_train_r) tsf_preds_r = tsf_reg.predict(X_test_r) mean_squared_error(y_test_r, tsf_preds_r) # In[29]: from aeon.visualisation import plot_scatter_predictions plot_scatter_predictions(y_test_r, tsf_preds_r) # An extensive comparison of feature-based pipelines [[8]](#references) found that TSFresh using its `"comprehensive"` feature set followed by a Rotation Forest classifier [[9]](#references) was significantly more accurate than other combinations. This configuration is hard coded into an `aeon` estimator called the FreshPRINCE. # In[30]: from aeon.classification.feature_based import FreshPRINCEClassifier from sklearn.metrics import accuracy_score fp = FreshPRINCEClassifier(n_estimators=100, random_state=42) fp.fit(X_train_c, y_train_c) fp_preds = fp.predict(X_test_c) accuracy_score(y_test_c, fp_preds) # ## Performance on the UCR univariate classification datasets # # Below we show the performance of the `Catch22Classifier`, `TSFreshClassifier` and `FreshPRINCEClassifier` pipelines on the UCR TSC archive datasets [[4]](#references) using results from a large scale comparison of TSC algorithms [[10]](#references). The results files are stored on [timeseriesclassification.com](timeseriesclassification.com). # In[31]: from aeon.benchmarking import get_estimator_results_as_array from aeon.datasets.tsc_datasets import univariate names = ["Catch22", "TSFresh", "FreshPRINCE", "1NN-DTW"] results, present_names = get_estimator_results_as_array( names, univariate, include_missing=False ) results.shape # In[32]: import numpy as np np.mean(results, axis=0) # In[33]: from aeon.visualisation import plot_critical_difference plot_critical_difference(results, names) # In[34]: from aeon.visualisation import plot_boxplot_median plot_boxplot_median(results, names, plot_type="boxplot") # ## Composable pipelines # # The majority of feature-based approaches (and all the ones demonstrated here) take the form of a simple pipeline estimator. Both `aeon` and `scikit-learn` provide composable utilities for building these pipelines using selected transformations and learners. # ### `aeon` pipelines # # `aeon` pipelines are built using the `make_pipeline` function, the same as `scikit-learn`. This function takes a list of transformations and a final estimator. The transformations are applied in order to the input data, and the final estimator is trained on the transformed data. # # The follow example z-normalises the time series, extracts Catch22 features and trains a Random Forest classifier. # In[35]: from aeon.pipeline import make_pipeline from sklearn.ensemble import RandomForestClassifier from aeon.transformations.collection.feature_based import Catch22 from aeon.transformations.collection.scaler import TimeSeriesScaler from sklearn.metrics import accuracy_score pipe = make_pipeline( TimeSeriesScaler(), Catch22(replace_nans=True), RandomForestClassifier(random_state=42) ) pipe.fit(X_train_c, y_train_c) pipe_preds = pipe.predict(X_test_c) accuracy_score(y_test_c, pipe_preds) # The same function applied for regression. # In[36]: from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error pipe = make_pipeline( TimeSeriesScaler(), Catch22(replace_nans=True), RandomForestRegressor(random_state=42) ) pipe.fit(X_train_r, y_train_r) pipe_preds = pipe.predict(X_test_r) mean_squared_error(y_test_r, pipe_preds) # ### `scikit-learn` pipelines # # The `scikit-learn` `make_pipeline` function can also be used. The following example extracts both the Catch22 features simple summary statistics then trains a Random Forest classifier. # In[37]: from sklearn.pipeline import make_pipeline from sklearn.ensemble import RandomForestClassifier from aeon.transformations.collection.feature_based import Catch22 from aeon.transformations.collection.feature_based import SevenNumberSummaryTransformer from sklearn.pipeline import FeatureUnion pipe = make_pipeline( FeatureUnion([("C22", Catch22(replace_nans=True)), ("SNS", SevenNumberSummaryTransformer())]), RandomForestClassifier(random_state=42) ) pipe.fit(X_train_c, y_train_c) pipe_preds = pipe.predict(X_test_c) accuracy_score(y_test_c, pipe_preds) # ## References # # [1] Fulcher, Ben D., and Nick S. Jones. "hctsa: A computational framework for automated time-series phenotyping using massive feature extraction." Cell systems 5.5 (2017): 527-531. # # [2] https://github.com/benfulcher/hctsa # # [3] Lubba, Carl H., et al. "catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis." Data Mining and Knowledge Discovery 33.6 (2019): 1821-1852. # # [4] Dau, Hoang Anh, et al. "The UCR time series archive." IEEE/CAA Journal of Automatica Sinica 6.6 (2019): 1293-1305. # # [5] https://github.com/DynamicsAndNeuralSystems/pycatch22 # # [6] Christ, Maximilian, et al. "Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package)." Neurocomputing 307 (2018): 72-77. # # [7] https://github.com/blue-yonder/tsfresh # # [8] Middlehurst, Matthew, and Anthony Bagnall. "The freshprince: A simple transformation based pipeline time series classifier." International Conference on Pattern Recognition and Artificial Intelligence. Cham: Springer International Publishing, 2022. # # [9] Rodriguez, Juan José, Ludmila I. Kuncheva, and Carlos J. Alonso. "Rotation forest: A new classifier ensemble method." IEEE transactions on pattern analysis and machine intelligence 28.10 (2006): 1619-1630. # # [10] Middlehurst, Matthew, Patrick Schäfer, and Anthony Bagnall. "Bake off redux: a review and experimental evaluation of recent time series classification algorithms." Data Mining and Knowledge Discovery (2024): 1-74. # # [Return to Table of Contents](#toc)