Clustering¶

This notebook contains the simple examples of timeseries clustering using ETNA library.

Table of contents

Generating dataset
Distances
Clustering
Advanced: Custom Distance
- Custom Distance implementation
- Custom Distance in clustering

In [1]:

import warnings

warnings.filterwarnings("ignore")

In [2]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from etna.datasets import TSDataset

1. Generating dataset ¶

In this notebook we will work with the toy dataset generated from Normal distribution. Timeseries are naturally separated into clusters based on the mean value.

In [3]:

def gen_dataset():
    df = pd.DataFrame()
    for i in range(1, 5):
        date_range = pd.date_range("2020-01-01", "2020-05-01")
        for j, sigma in enumerate([0.1, 0.3, 0.5, 0.8]):
            tmp = pd.DataFrame({"timestamp": date_range})
            tmp["segment"] = f"{2*i}{j}"
            tmp["target"] = np.random.normal(2 * i, sigma, len(tmp))
            df = df.append(tmp, ignore_index=True)
    ts = TSDataset(df=df, freq="D")
    return ts

Let's take a look at the dataset

In [4]:

ts = gen_dataset()
ts.df.plot(figsize=(20, 10), legend=False);

As you can see, there are about four clusters here. However, usually it is not obvious how to separate the timeseries into the clusters. Therefore, we might want to use clustering algorithms to help us.

2. Distance ¶

In order to combine series into clusters, it is necessary to set the distance function on them. Distances implement the Distance interface and can be computed for two pd.Series indexed by timestamp.

In our library we provide implementation of Euclidean and DTW Distances

In [5]:

from etna.clustering import DTWDistance
from etna.clustering import EuclideanDistance

In [6]:

x1 = pd.Series(
    data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8],
    index=pd.date_range("2020-01-01", periods=10),
)
x2 = pd.Series(
    data=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    index=pd.date_range("2020-01-02", periods=10),
)

Distance calculation in the case of different timestamps can be performed in two modes:

trim_series = True - calculate the distance only over the common part of the series. Common part is defined based on the timestamp index.

In [7]:

distance = EuclideanDistance(trim_series=True)
distance(x1, x2)  # Series are the same in the common part of the timestamps

Out[7]:

0.0

trim_series = False - calculate the distance over the whole series, ignoring the timestamp. The same as dropping the timestamp index and using the integer index as in the common array.

In [8]:

distance = EuclideanDistance(trim_series=False)
distance(x1, x2)

Out[8]:

3.0

For better understanding, take a look at the visualization below.

In [9]:

_ = plt.subplots(figsize=(15, 10))
img = plt.imread("./assets/clustering/trim_series.jpg")
plt.imshow(img);

Let's calculate different Distances with setting trim_series parameter to see the difference

In [10]:

distances = pd.DataFrame()
distances["Euclidean"] = [
    EuclideanDistance(trim_series=True)(x1, x2),
    EuclideanDistance(trim_series=False)(x1, x2),
]
distances["DTW"] = [
    DTWDistance(trim_series=True)(x1, x2),
    DTWDistance(trim_series=False)(x1, x2),
]
distances["trim_series"] = [True, False]
distances.set_index("trim_series", inplace=True)
distances

Out[10]:

	Euclidean	DTW
trim_series
True	0.0	0.0
False	3.0	1.0

3. Clustering ¶

Our library provides a class for so called hierarchical clustering, which has two built-in implementations for different Distances.

In [11]:

from etna.clustering import EuclideanClustering

Hierarchical clustering consists of three stages:

Building a matrix of pairwise distances between series (build_distance_matrix)
Initializing the clustering algorithm (build_clustering_algo)
Training the clustering algorithm and predicting clusters (fit_predict)

In this section you will find the description of step by step clustering process.

3.1 Building Distance Matrix ¶

On the first step we need to build the so-called Distance matrix containing the pairwise distances between the timeseries in the dataset. Note, that this is the most time-consuming part of the clustering process, and it may take a long time to build a Distance matrix for a large dataset.

Distance matrix for the Euclidean distance

In [12]:

model = EuclideanClustering()
model.build_distance_matrix(ts=ts)
sns.heatmap(model.distance_matrix.matrix);

The Distance matrix is computed ones and saved in the instance of HierarchicalClustering. This makes it possible to change the clustering parameters without recalculating the Distance matrix.

3.2 Building Clustering algorithm ¶

After computing the Distance matrix, you need to set the parameters to the clustering algorithm, such as:

n_clusters - number of clusters
linkage - rule for distance computation for new clusters

As the Distance matrix is build once, you can experiment with different parameters of the clustering algorithm without wasting time on its recomputation.

In [13]:

model = EuclideanClustering()
model.build_distance_matrix(ts)

In [14]:

model.build_clustering_algo(n_clusters=4, linkage="average")

3.3 Predicting clusters ¶

The final step of the clustering process is cluster prediction. As the output of the fit_predict you get the mapping for segment to cluster.

In [15]:

segment2cluster = model.fit_predict()
segment2cluster

Out[15]:

{'20': 1,
 '21': 1,
 '22': 1,
 '23': 1,
 '40': 2,
 '41': 2,
 '42': 2,
 '43': 2,
 '60': 0,
 '61': 0,
 '62': 0,
 '63': 0,
 '80': 3,
 '81': 3,
 '82': 3,
 '83': 3}

Let's visualize the results

In [16]:

colores = ["r", "g", "b", "y"]
color = [colores[i] for i in segment2cluster.values()]
ts.df.plot(figsize=(20, 10), color=color, legend=False);

3.4 Getting centroids ¶

In addition, it is possible to get the clusters centroids, which are the typical member of the corresponding cluster.

In [17]:

from etna.analysis import plot_clusters

In [18]:

centroids = model.get_centroids()
centroids.head()

Out[18]:

cluster	0	1	2	3
feature	target	target	target	target
timestamp
2020-01-01	6.179792	2.070317	4.023297	8.044708
2020-01-02	6.318186	2.028775	4.045133	8.414893
2020-01-03	5.860081	2.140499	3.637370	7.706840
2020-01-04	5.519718	2.622720	3.842983	8.045943
2020-01-05	6.090256	2.007586	3.557813	7.947730

Finally, we can plot clusters along with centroids for visual assessment

In [19]:

plot_clusters(ts, segment2cluster, centroids)

4. Advanced: Custom Distance ¶

In addition to the built-in Distances, you are able to implement your own Distance. The example below shows how to implement the Distance interface for custom Distance.

4.1 Custom Distance implementation ¶

In [20]:

from etna.clustering import Distance


class MyDistance(Distance):
    def __init__(self, trim_series: bool = False):
        super().__init__(trim_series=trim_series)

    def _compute_distance(self, x1: np.ndarray, x2: np.ndarray) -> float:
        """Compute distance between x1 and x2."""
        return np.max(np.abs(x1 - x2))

    def _get_average(self, ts: "TSDataset") -> pd.DataFrame:
        """Get series that minimizes squared distance to given ones according to the distance."""
        centroid = pd.DataFrame(
            {
                "timestamp": ts.index.values,
                "target": ts.df.median(axis=1).values,
            }
        )
        return centroid

In [21]:

distances = pd.DataFrame()
distances["MyDistance"] = [
    MyDistance(trim_series=True)(x1, x2),
    MyDistance(trim_series=False)(x1, x2),
]
distances["trim_series"] = [True, False]
distances.set_index("trim_series", inplace=True)
distances

Out[21]:

	MyDistance
trim_series
True	0
False	1

4.2 Custom Distance in clustering ¶

To specify the Distance in clustering algorithm you need to use the clustering base class HierarchicalClustering

In [22]:

from etna.clustering import HierarchicalClustering

In [23]:

model = HierarchicalClustering(distance=MyDistance())
model.build_distance_matrix(ts=ts)
sns.heatmap(model.distance_matrix.matrix);

We can see a visualization for our custom distance.

Clustering¶

1. Generating dataset ¶

2. Distance¶

3. Clustering ¶

3.1 Building Distance Matrix¶

3.2 Building Clustering algorithm ¶

3.3 Predicting clusters¶

3.4 Getting centroids¶

4. Advanced: Custom Distance¶

4.1 Custom Distance implementation¶

4.2 Custom Distance in clustering¶