This notebook contains the simple examples of time series forecasting pipeline using ETNA library.
Table of Contents
import warnings
warnings.filterwarnings(action="ignore", message="Torchmetrics v0.9")
Let's load and look at the dataset
import pandas as pd
original_df = pd.read_csv("data/monthly-australian-wine-sales.csv")
original_df.head()
month | sales | |
---|---|---|
0 | 1980-01-01 | 15136 |
1 | 1980-02-01 | 16733 |
2 | 1980-03-01 | 20016 |
3 | 1980-04-01 | 17708 |
4 | 1980-05-01 | 18019 |
etna_ts is strict about data format:
target
timestamp
segment
is also compulsoryOur library works with the special data structure TSDataset. So, before starting anything, we need to convert the classical DataFrame to TSDataset.
Let's rename first
original_df["timestamp"] = pd.to_datetime(original_df["month"])
original_df["target"] = original_df["sales"]
original_df.drop(columns=["month", "sales"], inplace=True)
original_df["segment"] = "main"
original_df.head()
timestamp | target | segment | |
---|---|---|---|
0 | 1980-01-01 | 15136 | main |
1 | 1980-02-01 | 16733 | main |
2 | 1980-03-01 | 20016 | main |
3 | 1980-04-01 | 17708 | main |
4 | 1980-05-01 | 18019 | main |
Time to convert to TSDataset!
To do this, we initially need to convert the classical DataFrame to the special format.
from etna.datasets.tsdataset import TSDataset
/Users/d.a.binin/Documents/tasks/etna-github/etna/settings.py:61: UserWarning: `tsfresh` is not available, to install it, run `pip install tsfresh==0.19.0 && pip install protobuf==3.20.1` warnings.warn(
df = TSDataset.to_dataset(original_df)
df.head()
segment | main |
---|---|
feature | target |
timestamp | |
1980-01-01 | 15136 |
1980-02-01 | 16733 |
1980-03-01 | 20016 |
1980-04-01 | 17708 |
1980-05-01 | 18019 |
Now we can construct the TSDataset.
Additionally to passing dataframe we should specify frequency of our data. In this case it is monthly data.
ts = TSDataset(df, freq="1M")
/Users/d.a.binin/Documents/tasks/etna-github/etna/datasets/tsdataset.py:124: UserWarning: You probably set wrong freq. Discovered freq in you data is MS, you set 1M warnings.warn(
Oups. Let's fix that
ts = TSDataset(df, freq="MS")
We can look at the basic information about the dataset
ts.info()
<class 'etna.datasets.TSDataset'> num_segments: 1 num_exogs: 0 num_regressors: 0 num_known_future: 0 freq: MS start_timestamp end_timestamp length num_missing segments main 1980-01-01 1994-08-01 176 0
Or in DataFrame format
ts.describe()
start_timestamp | end_timestamp | length | num_missing | num_segments | num_exogs | num_regressors | num_known_future | freq | |
---|---|---|---|---|---|---|---|---|---|
segments | |||||||||
main | 1980-01-01 | 1994-08-01 | 176 | 0 | 1 | 0 | 0 | 0 | MS |
Let's take a look at the time series in the dataset
ts.plot()
Our library contains a wide range of different models for time series forecasting. Let's look at some of them.
Let's predict the monthly values in 1994 in our dataset using the NaiveModel
train_ts, test_ts = ts.train_test_split(
train_start="1980-01-01",
train_end="1993-12-01",
test_start="1994-01-01",
test_end="1994-08-01",
)
HORIZON = 8
from etna.models import NaiveModel
# Fit the model
model = NaiveModel(lag=12)
model.fit(train_ts)
# Make the forecast
future_ts = train_ts.make_future(future_steps=HORIZON, tail_steps=model.context_size)
forecast_ts = model.forecast(future_ts, prediction_size=HORIZON)
Here we pass prediction_size
parameter during forecast
because in forecast_ts
few first points are dedicated to be a context for NaiveModel
.
Now let's look at a metric and plot the prediction. All the methods already built-in in etna.
from etna.metrics import SMAPE
smape = SMAPE()
smape(y_true=test_ts, y_pred=forecast_ts)
{'main': 11.492045838249387}
from etna.analysis import plot_forecast
plot_forecast(forecast_ts, test_ts, train_ts, n_train_samples=10)
Now try to improve the forecast and predict the values with Prophet.
from etna.models import ProphetModel
model = ProphetModel()
model.fit(train_ts)
# Make the forecast
future_ts = train_ts.make_future(HORIZON)
forecast_ts = model.forecast(future_ts)
14:15:11 - cmdstanpy - INFO - Chain [1] start processing 14:15:11 - cmdstanpy - INFO - Chain [1] done processing
smape(y_true=test_ts, y_pred=forecast_ts)
{'main': 10.510260655718435}
plot_forecast(forecast_ts, test_ts, train_ts, n_train_samples=10)
And finally let's try the Catboost model.
Also etna has wide range of transforms you may apply to your data.
Here how it is done:
from etna.transforms import LagTransform
lags = LagTransform(in_column="target", lags=list(range(8, 24, 1)))
train_ts.fit_transform([lags])
from etna.models import CatBoostMultiSegmentModel
model = CatBoostMultiSegmentModel()
model.fit(train_ts)
future_ts = train_ts.make_future(HORIZON)
forecast_ts = model.forecast(future_ts)
from etna.metrics import SMAPE
smape = SMAPE()
smape(y_true=test_ts, y_pred=forecast_ts)
{'main': 10.715432057450386}
from etna.analysis import plot_forecast
train_ts.inverse_transform()
plot_forecast(forecast_ts, test_ts, train_ts, n_train_samples=10)
In this section you may see example of how easily etna works with multiple time series and get acquainted with other transforms etna contains.
original_df = pd.read_csv("data/example_dataset.csv")
original_df.head()
timestamp | segment | target | |
---|---|---|---|
0 | 2019-01-01 | segment_a | 170 |
1 | 2019-01-02 | segment_a | 243 |
2 | 2019-01-03 | segment_a | 267 |
3 | 2019-01-04 | segment_a | 287 |
4 | 2019-01-05 | segment_a | 279 |
df = TSDataset.to_dataset(original_df)
ts = TSDataset(df, freq="D")
ts.plot()
ts.info()
<class 'etna.datasets.TSDataset'> num_segments: 4 num_exogs: 0 num_regressors: 0 num_known_future: 0 freq: D start_timestamp end_timestamp length num_missing segments segment_a 2019-01-01 2019-11-30 334 0 segment_b 2019-01-01 2019-11-30 334 0 segment_c 2019-01-01 2019-11-30 334 0 segment_d 2019-01-01 2019-11-30 334 0
import warnings
from etna.transforms import (
MeanTransform,
LagTransform,
LogTransform,
SegmentEncoderTransform,
DateFlagsTransform,
LinearTrendTransform,
)
warnings.filterwarnings("ignore")
log = LogTransform(in_column="target")
trend = LinearTrendTransform(in_column="target")
seg = SegmentEncoderTransform()
lags = LagTransform(in_column="target", lags=list(range(30, 96, 1)))
d_flags = DateFlagsTransform(
day_number_in_week=True,
day_number_in_month=True,
week_number_in_month=True,
week_number_in_year=True,
month_number_in_year=True,
year_number=True,
special_days_in_week=[5, 6],
)
mean30 = MeanTransform(in_column="target", window=30)
HORIZON = 30
train_ts, test_ts = ts.train_test_split(
train_start="2019-01-01",
train_end="2019-10-31",
test_start="2019-11-01",
test_end="2019-11-30",
)
train_ts.fit_transform([log, trend, lags, d_flags, seg, mean30])
test_ts.info()
<class 'etna.datasets.TSDataset'> num_segments: 4 num_exogs: 0 num_regressors: 0 num_known_future: 0 freq: D start_timestamp end_timestamp length num_missing segments segment_a 2019-11-01 2019-11-30 30 0 segment_b 2019-11-01 2019-11-30 30 0 segment_c 2019-11-01 2019-11-30 30 0 segment_d 2019-11-01 2019-11-30 30 0
from etna.models.catboost import CatBoostMultiSegmentModel
model = CatBoostMultiSegmentModel()
model.fit(train_ts)
future_ts = train_ts.make_future(HORIZON)
forecast_ts = model.forecast(future_ts)
smape = SMAPE()
smape(y_true=test_ts, y_pred=forecast_ts)
{'segment_c': 11.729007773459314, 'segment_b': 4.210896545479218, 'segment_a': 6.059390208724575, 'segment_d': 4.98784059255331}
train_ts.inverse_transform()
plot_forecast(forecast_ts, test_ts, train_ts, n_train_samples=20)
Let's wrap everything into pipeline to create the end-to-end model from previous section.
from etna.pipeline import Pipeline
train_ts, test_ts = ts.train_test_split(
train_start="2019-01-01",
train_end="2019-10-31",
test_start="2019-11-01",
test_end="2019-11-30",
)
We put: model, transforms and horizon in a single object, which has the similar interface with the model(fit/forecast)
model = Pipeline(
model=CatBoostMultiSegmentModel(),
transforms=[log, trend, lags, d_flags, seg, mean30],
horizon=HORIZON,
)
model.fit(train_ts)
forecast_ts = model.forecast()
As in the previous section, let's calculate the metrics and plot the forecast
smape = SMAPE()
smape(y_true=test_ts, y_pred=forecast_ts)
{'segment_c': 11.729007773459314, 'segment_b': 4.210896545479218, 'segment_a': 6.059390208724575, 'segment_d': 4.98784059255331}
plot_forecast(forecast_ts, test_ts, train_ts, n_train_samples=20)