This notebook contains the simple examples of using feature extractor transforms with ETNA library.
Table of contents
import warnings
warnings.filterwarnings("ignore")
import time
from copy import deepcopy
import pandas as pd
from etna.analysis import plot_backtest
from etna.datasets import TSDataset
from etna.metrics import SMAPE
from etna.models import CatBoostPerSegmentModel
from etna.pipeline import Pipeline
from etna.transforms import DateFlagsTransform
from etna.transforms import FourierTransform
from etna.transforms import HolidayTransform
from etna.transforms import LagTransform
from etna.transforms import MeanTransform
from etna.transforms import TrendTransform
HORIZON = 60
We are going to work with the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales — mugs, hats, and stickers — at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.
df = pd.read_csv("data/nordic_merch_sales.csv")
df = TSDataset.to_dataset(df)
ts = TSDataset(df, freq="D")
ts.plot(4)
Let's create features and build pipeline with dataset:
transforms = [
TrendTransform(in_column="target", out_column="trend"),
LagTransform(in_column="target", lags=range(HORIZON, 100), out_column="target_lag"),
DateFlagsTransform(
day_number_in_month=True, day_number_in_week=False, is_weekend=False, out_column="datetime_flag"
),
MeanTransform(in_column=f"target_lag_{HORIZON}", window=12, seasonality=7, out_column="mean_transform"),
FourierTransform(period=365, order=30, out_column="fourier_year"),
FourierTransform(period=30.4, order=10, out_column="fourier_month"),
HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
]
pipeline = Pipeline(model=CatBoostPerSegmentModel(), transforms=transforms, horizon=HORIZON)
start_time = time.perf_counter()
metrics_baseline, forecast_baseline, _ = pipeline.backtest(ts=ts, metrics=[SMAPE()], n_folds=1)
time_baseline = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.8min [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.8min [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.0s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.0s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s
plot_backtest(forecast_baseline, ts, history_len=60, segments=ts.segments[:4])
smape_baseline = metrics_baseline["SMAPE"].mean()
print(f"SMAPE baseline: {smape_baseline:.3f}")
print(f"Time baseline: {time_baseline:.3f}")
SMAPE baseline: 8.094 Time baseline: 106.122
Let's look how many features are used during fitting a model:
temp_ts = deepcopy(ts)
temp_ts.fit_transform(transforms=transforms)
num_features = len(temp_ts.columns.get_level_values("feature").unique()) - 1
print(f"Num features: {num_features}")
Num features: 126
Now our pipeline has too much features, takes a lot of time for fit
and predict
methods and shows not perfect metrics. ETNA provides some methods to optimize this situation by removing some features.
TreeFeatureSelectionTransform
¶This method uses tree-based model to estimate importance of each feature. After that method leaves top_k
best features. This class accepts as input the next parameters:
model
: model for estimating feature importance. Must have fit
method and feature_importances_
parameter;top_k
: number of features to select;features_to_use
: which features participate in feature selection;return_features
: indicates whether to save features or not. It should be set to True
only if you want to return features back to the dataset during inverse_transform
.Let's optimize our baseline:
from sklearn.tree import DecisionTreeRegressor
from etna.transforms import TreeFeatureSelectionTransform
model_feature_selection = DecisionTreeRegressor(random_state=42)
feature_selector_transform = TreeFeatureSelectionTransform(model=model_feature_selection, top_k=20)
pipeline = Pipeline(
model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
start_time = time.perf_counter()
metrics_tree_feature_selector, forecast_tree_feature_selector, _ = pipeline.backtest(
ts=ts, metrics=[SMAPE()], n_folds=1
)
time_tree_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 37.6s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 37.6s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s
plot_backtest(forecast_tree_feature_selector, ts, history_len=60, segments=ts.segments[:4])
Let's look whether metrics had improved or not:
smape_tree_feature_selector = metrics_tree_feature_selector["SMAPE"].mean()
print(f"SMAPE with TreeFeatureSelectionTransform: {smape_tree_feature_selector:.3f}")
print(f"Time with TreeFeatureSelectionTransform: {time_tree_feature_selector:.3f}")
SMAPE with TreeFeatureSelectionTransform: 7.371 Time with TreeFeatureSelectionTransform: 38.784
Results:
ETNA also provides methods to plot importance of each feature:
from etna.analysis import ModelRelevanceTable
from etna.analysis import plot_feature_relevance
temp_ts = deepcopy(ts)
temp_ts.fit_transform(transforms=transforms)
plot_feature_relevance(
temp_ts,
relevance_table=ModelRelevanceTable(),
top_k=20,
relevance_params={"model": model_feature_selection},
)
GaleShapleyFeatureSelectionTransform
¶Let's assume that we have a table of relevances with size $N_{f} \times N_{s}$ where $N_{f}$ — number of features, $N_{s}$ — number of segments.
Procedure of filtering features consist of $\lceil \frac{k}{N_{s}} \rceil$ iterations. Algorithm of each iteration:
top_k
;In a library we have two possible ways to build such a table:
etna.analysis.StatisticsRelevanceTable
: calculates p-value relevances using tsfresh
;etna.analysis.ModelRelevanceTable
: calculate standard feature importance relevances using tree-based models.In this example let's try StatisticsRelevanceTable
.
from etna.analysis import StatisticsRelevanceTable
from etna.transforms import GaleShapleyFeatureSelectionTransform
rt = StatisticsRelevanceTable()
feature_selector_transform = GaleShapleyFeatureSelectionTransform(top_k=20, relevance_table=rt)
pipeline = Pipeline(
model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
start_time = time.perf_counter()
metrics_galeshapley_feature_selector, forecast_galeshapley_feature_selector, _ = pipeline.backtest(
ts=ts, metrics=[SMAPE()], n_folds=1
)
time_galeshapley_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 42.6s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 42.6s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.0s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.0s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s
plot_backtest(forecast_galeshapley_feature_selector, ts, history_len=60, segments=ts.segments[:4])
smape_galeshapley_feature_selector = metrics_galeshapley_feature_selector["SMAPE"].mean()
print(f"SMAPE with GaleShapleyFeatureSelectionTransform: {smape_galeshapley_feature_selector:.3f}")
print(f"Time with GaleShapleyFeatureSelectionTransform: {time_galeshapley_feature_selector:.3f}")
SMAPE with GaleShapleyFeatureSelectionTransform: 7.402 Time with GaleShapleyFeatureSelectionTransform: 43.729
Results:
MRMRFeatureSelectionTransform
¶The MRMR algorithm stands for "Maximum Relevance - Minimum Redundancy", the idea is to select features, which are informative for the target and aren't correlated with each other. At each iteration we select only one feature that has maximum relevance for the target and minimum redundancy with already selected set of features.
For achieving this we select the feature $f$ that maximizes the score: $$ score(f) = \frac{\operatorname{relevance}(f | target)}{\operatorname{redundancy}(f | prev)}, $$
where $prev$ are previously selected features.
We use: $$ \operatorname{relevance}(f | target) = \operatorname{F}(f, target), \\ \operatorname{redundancy}(f | prev) = \sum_{s \in prev} |\rho(f, s)| \cdot \frac{1}{i-1}, $$
where $\operatorname{F}$ is F-statistic, $\rho$ is Pearson correlation, $i$ is number of iteration.
Let's look at example of usage. Now we are going to try ModelRelevanceTable
.
from etna.transforms import MRMRFeatureSelectionTransform
rt = ModelRelevanceTable()
feature_selector_transform = MRMRFeatureSelectionTransform(
top_k=20,
relevance_table=rt,
fast_redundancy=True,
model=model_feature_selection,
)
pipeline = Pipeline(
model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
Here we set fast_redundancy=True
to use optimized version of the algorithm, it isn't default for backward compatibility.
start_time = time.perf_counter()
metrics_mrmr_feature_selector, forecast_mrmr_feature_selector, _ = pipeline.backtest(
ts=ts, metrics=[SMAPE()], n_folds=1
)
time_mrmr_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.2min [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.2min [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.2s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 1.2s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s [Parallel(n_jobs=1)]: Done 1 tasks | elapsed: 0.1s
plot_backtest(forecast_mrmr_feature_selector, ts, history_len=60, segments=ts.segments[:4])
smape_mrmr_feature_selector = metrics_mrmr_feature_selector["SMAPE"].mean()
print(f"SMAPE with MRMRFeatureSelectionTransform: {smape_mrmr_feature_selector:.3f}")
print(f"Time with MRMRFeatureSelectionTransform: {time_mrmr_feature_selector:.3f}")
SMAPE with MRMRFeatureSelectionTransform: 7.416 Time with MRMRFeatureSelectionTransform: 70.717
Results:
In this notebook we have checked some methods of feature selection. As we have seen, such tricks can improve metrics and decrease runtime of your pipeline.