! pip install -q datasets gluonts orjson
GluonTS comes with a pandas DataFrame based dataset so our strategy will be to read the csv file, and process it as a PandasDataset
. We will then iterate over it and convert it to a 🤗 dataset with the appropriate schema for time series. So lets get started!
PandasDataset
¶Suppose we are given multiple (10) time series stacked on top of each other in a dataframe with an item_id
column that distinguishes different series:
import pandas as pd
url = (
"https://gist.githubusercontent.com/rsnirwan/a8b424085c9f44ef2598da74ce43e7a3"
"/raw/b6fdef21fe1f654787fa0493846c546b7f9c4df2/ts_long.csv"
)
df = pd.read_csv(url, index_col=0, parse_dates=True)
df.head()
target | item_id | |
---|---|---|
2021-01-01 00:00:00 | -1.3378 | A |
2021-01-01 01:00:00 | -1.6111 | A |
2021-01-01 02:00:00 | -1.9259 | A |
2021-01-01 03:00:00 | -1.9184 | A |
2021-01-01 04:00:00 | -1.9168 | A |
After converting it into a pd.Dataframe
we can then convert it into GluonTS's PandasDataset
:
from gluonts.dataset.pandas import PandasDataset
ds = PandasDataset.from_long_dataframe(df, target="target", item_id="item_id")
From here we have to map the pandas dataset's start
field into a time stamp instead of a pd.Period
. We do this by defining the following class:
class ProcessStartField():
ts_id = 0
def __call__(self, data):
data["start"] = data["start"].to_timestamp()
data["feat_static_cat"] = [self.ts_id]
self.ts_id += 1
return data
from gluonts.itertools import Map
process_start = ProcessStartField()
list_ds = list(Map(process_start, ds))
Next we need to define our schema features and create our dataset from this list via the from_list
function:
from datasets import Dataset, Features, Value, Sequence
features = Features(
{
"start": Value("timestamp[s]"),
"target": Sequence(Value("float32")),
"feat_static_cat": Sequence(Value("uint64")),
# "feat_static_real": Sequence(Value("float32")),
# "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
# "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
"item_id": Value("string"),
}
)
dataset = Dataset.from_list(list_ds, features=features)
We can thus use this strategy to share the dataset to the hub.