Time Series Datasets¶

This notebook shows how to create a time series dataset from some csv file in order to then share it on the 🤗 hub. We will use the GluonTS library to read the csv into the appropriate format. We start by installing the libraries

In [31]:

! pip install -q datasets gluonts orjson

GluonTS comes with a pandas DataFrame based dataset so our strategy will be to read the csv file, and process it as a PandasDataset. We will then iterate over it and convert it to a 🤗 dataset with the appropriate schema for time series. So lets get started!

`PandasDataset`¶

Suppose we are given multiple (10) time series stacked on top of each other in a dataframe with an item_id column that distinguishes different series:

In [25]:

import pandas as pd

url = (
    "https://gist.githubusercontent.com/rsnirwan/a8b424085c9f44ef2598da74ce43e7a3"
    "/raw/b6fdef21fe1f654787fa0493846c546b7f9c4df2/ts_long.csv"
)
df = pd.read_csv(url, index_col=0, parse_dates=True)
df.head()

Out[25]:

	target	item_id
2021-01-01 00:00:00	-1.3378	A
2021-01-01 01:00:00	-1.6111	A
2021-01-01 02:00:00	-1.9259	A
2021-01-01 03:00:00	-1.9184	A
2021-01-01 04:00:00	-1.9168	A

After converting it into a pd.Dataframe we can then convert it into GluonTS's PandasDataset:

In [26]:

from gluonts.dataset.pandas import PandasDataset

ds = PandasDataset.from_long_dataframe(df, target="target", item_id="item_id")

🤗 Datasets¶

From here we have to map the pandas dataset's start field into a time stamp instead of a pd.Period. We do this by defining the following class:

In [27]:

class ProcessStartField():
    ts_id = 0
    
    def __call__(self, data):
        data["start"] = data["start"].to_timestamp()
        data["feat_static_cat"] = [self.ts_id]
        self.ts_id += 1
        
        return data

In [28]:

from gluonts.itertools import Map

process_start = ProcessStartField()

list_ds = list(Map(process_start, ds))

Next we need to define our schema features and create our dataset from this list via the from_list function:

In [29]:

from datasets import Dataset, Features, Value, Sequence

features  = Features(
    {    
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "feat_static_cat": Sequence(Value("uint64")),
        # "feat_static_real":  Sequence(Value("float32")),
        # "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
        # "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
        "item_id": Value("string"),
    }
)

In [30]:

dataset = Dataset.from_list(list_ds, features=features)

We can thus use this strategy to share the dataset to the hub.

Time Series Datasets¶

PandasDataset¶

🤗 Datasets¶

`PandasDataset`¶