This notebook contains the simple examples of timeseries clustering using ETNA library.
Table of contents
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from etna.datasets import TSDataset
In this notebook we will work with the toy dataset generated from Normal distribution. Timeseries are naturally separated into clusters based on the mean value.
def gen_dataset():
df = pd.DataFrame()
for i in range(1, 5):
date_range = pd.date_range("2020-01-01", "2020-05-01")
for j, sigma in enumerate([0.1, 0.3, 0.5, 0.8]):
tmp = pd.DataFrame({"timestamp": date_range})
tmp["segment"] = f"{2*i}{j}"
tmp["target"] = np.random.normal(2 * i, sigma, len(tmp))
df = df.append(tmp, ignore_index=True)
ts = TSDataset(df=df, freq="D")
return ts
Let's take a look at the dataset
ts = gen_dataset()
ts.df.plot(figsize=(20, 10), legend=False);
As you can see, there are about four clusters here. However, usually it is not obvious how to separate the timeseries into the clusters. Therefore, we might want to use clustering algorithms to help us.
from etna.clustering import DTWDistance
from etna.clustering import EuclideanDistance
x1 = pd.Series(
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8],
index=pd.date_range("2020-01-01", periods=10),
)
x2 = pd.Series(
data=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
index=pd.date_range("2020-01-02", periods=10),
)
Distance calculation in the case of different timestamps can be performed in two modes:
trim_series = True
- calculate the distance only over the common part of the series. Common part is defined based on the timestamp index.distance = EuclideanDistance(trim_series=True)
distance(x1, x2) # Series are the same in the common part of the timestamps
0.0
trim_series = False
- calculate the distance over the whole series, ignoring the timestamp. The same as dropping the timestamp index and using the integer index as in the common array.distance = EuclideanDistance(trim_series=False)
distance(x1, x2)
3.0
For better understanding, take a look at the visualization below.
_ = plt.subplots(figsize=(15, 10))
img = plt.imread("./assets/clustering/trim_series.jpg")
plt.imshow(img);
Let's calculate different Distances with setting trim_series
parameter to see the difference
distances = pd.DataFrame()
distances["Euclidean"] = [
EuclideanDistance(trim_series=True)(x1, x2),
EuclideanDistance(trim_series=False)(x1, x2),
]
distances["DTW"] = [
DTWDistance(trim_series=True)(x1, x2),
DTWDistance(trim_series=False)(x1, x2),
]
distances["trim_series"] = [True, False]
distances.set_index("trim_series", inplace=True)
distances
Euclidean | DTW | |
---|---|---|
trim_series | ||
True | 0.0 | 0.0 |
False | 3.0 | 1.0 |
Our library provides a class for so called hierarchical clustering, which has two built-in implementations for different Distances.
from etna.clustering import EuclideanClustering
Hierarchical clustering consists of three stages:
build_distance_matrix
)build_clustering_algo
)fit_predict
)In this section you will find the description of step by step clustering process.
On the first step we need to build the so-called Distance matrix containing the pairwise distances between the timeseries in the dataset. Note, that this is the most time-consuming part of the clustering process, and it may take a long time to build a Distance matrix for a large dataset.
Distance matrix for the Euclidean distance
model = EuclideanClustering()
model.build_distance_matrix(ts=ts)
sns.heatmap(model.distance_matrix.matrix);
The Distance matrix is computed ones and saved in the instance of HierarchicalClustering
. This makes it possible to change the clustering parameters without recalculating the Distance matrix.
After computing the Distance matrix, you need to set the parameters to the clustering algorithm, such as:
As the Distance matrix is build once, you can experiment with different parameters of the clustering algorithm without wasting time on its recomputation.
model = EuclideanClustering()
model.build_distance_matrix(ts)
model.build_clustering_algo(n_clusters=4, linkage="average")
The final step of the clustering process is cluster prediction. As the output of the fit_predict
you get the mapping for segment to cluster.
segment2cluster = model.fit_predict()
segment2cluster
{'20': 1, '21': 1, '22': 1, '23': 1, '40': 2, '41': 2, '42': 2, '43': 2, '60': 0, '61': 0, '62': 0, '63': 0, '80': 3, '81': 3, '82': 3, '83': 3}
Let's visualize the results
colores = ["r", "g", "b", "y"]
color = [colores[i] for i in segment2cluster.values()]
ts.df.plot(figsize=(20, 10), color=color, legend=False);
In addition, it is possible to get the clusters centroids, which are the typical member of the corresponding cluster.
from etna.analysis import plot_clusters
centroids = model.get_centroids()
centroids.head()
cluster | 0 | 1 | 2 | 3 |
---|---|---|---|---|
feature | target | target | target | target |
timestamp | ||||
2020-01-01 | 6.179792 | 2.070317 | 4.023297 | 8.044708 |
2020-01-02 | 6.318186 | 2.028775 | 4.045133 | 8.414893 |
2020-01-03 | 5.860081 | 2.140499 | 3.637370 | 7.706840 |
2020-01-04 | 5.519718 | 2.622720 | 3.842983 | 8.045943 |
2020-01-05 | 6.090256 | 2.007586 | 3.557813 | 7.947730 |
Finally, we can plot clusters along with centroids for visual assessment
plot_clusters(ts, segment2cluster, centroids)
In addition to the built-in Distances, you are able to implement your own Distance. The example below shows how to implement the Distance
interface for custom Distance.
from etna.clustering import Distance
class MyDistance(Distance):
def __init__(self, trim_series: bool = False):
super().__init__(trim_series=trim_series)
def _compute_distance(self, x1: np.ndarray, x2: np.ndarray) -> float:
"""Compute distance between x1 and x2."""
return np.max(np.abs(x1 - x2))
def _get_average(self, ts: "TSDataset") -> pd.DataFrame:
"""Get series that minimizes squared distance to given ones according to the distance."""
centroid = pd.DataFrame(
{
"timestamp": ts.index.values,
"target": ts.df.median(axis=1).values,
}
)
return centroid
distances = pd.DataFrame()
distances["MyDistance"] = [
MyDistance(trim_series=True)(x1, x2),
MyDistance(trim_series=False)(x1, x2),
]
distances["trim_series"] = [True, False]
distances.set_index("trim_series", inplace=True)
distances
MyDistance | |
---|---|
trim_series | |
True | 0 |
False | 1 |
To specify the Distance in clustering algorithm you need to use the clustering base class HierarchicalClustering
from etna.clustering import HierarchicalClustering
model = HierarchicalClustering(distance=MyDistance())
model.build_distance_matrix(ts=ts)
sns.heatmap(model.distance_matrix.matrix);
We can see a visualization for our custom distance.