This tutorial summarizes the findings of the Matrix Profile V paper and replicates the presented case studies.
Time series motifs, as described in the STUMPY Basics tutorial, are approximately repeated subsequences within a time series. While the concept of motif discovery is crucial to many fields and practical applications, the process of motif discovery isn't usually as clean in the real world as we would like it to be. In most practical uses of motif discovery, we find some motifs to be more desirable than others.
Consider, for instance, the Amazon Customer Reviews Dataset which contains several million customer reviews from Amazon's website. If we were to combine all of the textual reviews to reveal the most used words on the platform, we would, unsurprisingly, find the word "Amazon" to be around the top of the list, probably following a few other more commonly used words, such as "the", "and" and "is". While this result is obvious, it is certainly not useful. In order to produce more insightful results, we would need to filter the results to exclude less desirable, or "stop words", and make way for the more desirable ones.
When it comes to motif discovery, we use a similar approach. While we can't "filter" a time series, we can favor the identification of more interesting motifs and penalize the less interesting ones. This tutorial will provide a brief introduction to Annotation Vectors and explore their roles in Guided Motif Search.
Let’s import the packages that we’ll need to load, analyze, and plot the data.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from stumpy import stump
plt.style.use('stumpy.mplstyle')
An annotation vector contains real-valued numbers between 0 and 1 and can be used to distinguish between the important/unimportant subsequences within your time series (with 0 indicating undesirable and 1 indicating desirable). Annotation vectors for a time series can be combined with the matrix profile to formulate the Corrected Matrix Profile, which we can use to identify our desirable motifs. The corrected matrix profile can be acquired using the following formula:
$$ \begin{align*} CMP[i] &= MP[i] + (1 - AV[i]) \times max(MP) \end{align*} $$$CMP$ is the corrected matrix profile, $MP$ is the original matrix profile, $AV$ is the annotation vector and $max(MP)$ is the maximum value of the original matrix profile. Essentially, this formula transforms the original matrix profile by shifting the undesirable distances towards the maximum allowable matrix profile value, $max(MP)$ and thereby eliminating those corresponding subsequences as potential motifs.
In order to demonstrate guided motif search, we can construct an annotation vector based on desirable and undesirable time series subsequences. In many cases, the undesirability of some subsquences is absolute. In other words, we want to completely disallow some regions in the time series from being considered in our motif search. Thus, our annotation vector will be 1.0 at desirable regions and 0.0 during undesirable regions. There will be no in-between.
Consider the fNIRS brain imaging dataset produced by the UCLA Medical Center. fNIRS, or functional near-infrared specroscopy is an optical imaging technique that is used to measure Hemoglobin concentrations throughout the brain.
df = pd.read_csv('https://zenodo.org/record/5045218/files/hemoglobin.csv?download=1')
df = df[6000:14000]
df = df.reset_index(drop=True)
df.head()
Hemoglobin Concentration | Sensor Acceleration | |
---|---|---|
0 | 4578.0 | 7861.0 |
1 | 4579.0 | 8008.0 |
2 | 4580.0 | 7959.0 |
3 | 4581.0 | 7959.0 |
4 | 4582.0 | 7959.0 |
The dataset contains two columns: the Hemoglobin concentration and the sensor acceleration data, the latter of which we will get to later. We can visualize the Hemoglobin concentration time series first.
plt.plot(df['Hemoglobin Concentration'], color='lime', label='Hemoglobin Concentration')
plt.xlabel('Time')
plt.ylabel('690 nm intensity')
plt.title('Hemoglobin Concentration')
plt.show()
We can use the stump
function from STUMPY to compute the matrix profile of this time series and identify the nearest neighbor motifs.
def get_motif_data(data, matrix_profile, m):
motif_index = np.argmin(matrix_profile[:, 0])
motif_x = np.arange(motif_index, motif_index + m)
motif_y = data[motif_index:motif_index + m]
motif = (motif_index, motif_x, motif_y)
neighbor_index = matrix_profile[motif_index, 1]
neighbor_x = np.arange(neighbor_index, neighbor_index + m)
neighbor_y = data[neighbor_index:neighbor_index + m]
neighbor = (neighbor_index, neighbor_x, neighbor_y)
return motif, neighbor
m = 600
matrix_profile = stump(df['Hemoglobin Concentration'], m)
motif, neighbor = get_motif_data(df['Hemoglobin Concentration'], matrix_profile, m)
print(f'Minimum Z-Normalized Euclidean Distance Between 2 Subsequences: {matrix_profile[motif[0]][0]}')
print(f'Starting Indices of Motifs: {motif[0]}, {neighbor[0]}')
Minimum Z-Normalized Euclidean Distance Between 2 Subsequences: 4.1063461123005505 Starting Indices of Motifs: 3995, 5995
Now that we know the locations of the motifs, we can visualize them.
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Hemoglobin Concentration Time Series Motif Discovery With Matrix Profile', fontsize='15')
axis[0].plot(df['Hemoglobin Concentration'], color='lime', label='Hemoglobin Concentration')
axis[0].plot(motif[1], motif[2], color='red', linewidth=6, alpha=0.5, label='Motif')
axis[0].plot(neighbor[1], neighbor[2], color='mediumblue', linewidth=6, alpha=0.5, label='Nearest Neighbour')
axis[0].legend()
axis[0].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[0].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[0].set_ylabel('690 nm intensity')
axis[0].set_title('Hemoglobin Concentration')
axis[1].plot(matrix_profile[:, 0], color='darkorange', label='Matrix Profile')
axis[1].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[1].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[1].set_xlabel('Time')
axis[1].set_ylabel('Z-Normalized Euclidean Distance')
axis[1].set_title('Matrix Profile')
plt.show()
Without further context, the results look like exactly what we want. The two subsequences identified are, in fact, the closest matches.
However, it turns out that the nearest neighbor motifs identified belong to a section that has no medical significance. What the brain imaging sensor was picking up during this interval was simply the variation in infrared radiation caused by the physical movement of the test subject and not by the Hemoglobin concentrations of the brain. This can be verified by visualizing the sensor acceleration data which reveals rapid movement between the 4000 and 7000 marks.
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Sensor Acceleration Variation With fNIRS Data', fontsize='15')
axis[0].plot(df['Hemoglobin Concentration'], color='lime', label='Hemoglobin Concentration')
axis[0].set_ylabel('690 nm intensity')
axis[0].set_title('Hemoglobin Concentration')
axis[1].plot(df['Sensor Acceleration'], color='mediumturquoise', label='Sensor Acceleration')
axis[1].set_xlabel('Time')
axis[1].set_title('Sensor Acceleration')
plt.show()
So, given this data, how do we rule out these intervals? How do we produce an annotation vector from this time series?
We do this by monitoring the variations in the sensor acceleration. We start by computing the sliding standard deviation of the sensor acceleration data. This gives us the standard deviation vector of the subsequences of this time series. We then compute the overall mean of the standard deviation vector and compare the sliding standard deviation to the mean at different parts of the time series.
If at any point in the time series, the local standard deviation is greater than the mean of the standard deviation vector, this indicates a region of physical movement, and so the annotation vector of this region should be 0. Similarly, if the local standard deviation is less than the mean of the standard deviation vector, this indicates a region that's possibly free from physical movement, so we annotate this region to a 1 in the annotation vector.
acc_stdev = pd.Series(df['Sensor Acceleration']).rolling(m).std()
acc_stdev = np.array(acc_stdev[m - 1:])
acc_stdev_mean = np.mean(acc_stdev)
acc_stdev_mean_series = np.full(np.shape(acc_stdev)[0], acc_stdev_mean)
annotation_vector = (acc_stdev < acc_stdev_mean).astype(np.uint8)
fig, axis = plt.subplots(3, sharex=True)
plt.suptitle('Annotation Vector Generation From Sensor Acceleration Data', fontsize='15')
axis[0].plot(df['Sensor Acceleration'], color='mediumturquoise', label='Sensor Acceleration')
axis[0].set_title('Sensor Acceleration')
axis[1].plot(acc_stdev, color='orchid', label='Sliding Standard Deviation')
axis[1].plot(acc_stdev_mean_series, color='lightsteelblue', label='Mean of Sliding Standard Deviation')
axis[1].legend()
axis[1].set_title('Sensor Acceleration Sliding Standard Deviation')
axis[2].plot(annotation_vector, color='slateblue', label='Annotation Vector')
axis[2].set_xlabel('Time')
axis[2].set_title('Annotation Vector')
plt.show()
Now we have an annotation vector that we can use to rule out the regions of physical movement. We can use the formula introduced earlier to compute the corrected matrix profile.
def get_corrected_matrix_profile(matrix_profile, annotation_vector):
corrected_matrix_profile = matrix_profile.copy()
corrected_matrix_profile[:, 0] = matrix_profile[:, 0] + ((1 - annotation_vector) * np.max(matrix_profile[:, 0]))
return corrected_matrix_profile
corrected_matrix_profile = get_corrected_matrix_profile(matrix_profile, annotation_vector)
motif, neighbor = get_motif_data(df['Hemoglobin Concentration'], corrected_matrix_profile, m)
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Matrix Profile Correction Using Annotation Vector', fontsize='15')
axis[0].plot(matrix_profile[:, 0], color='darkorange', linewidth=5, alpha=0.3, label='Matrix Profile')
axis[0].plot(corrected_matrix_profile[:, 0], color='mediumvioletred', label='Corrected Matrix Profile')
axis[0].set_ylabel('Z-Normalized Euclidean Distance')
axis[0].set_title('Corrected & Original Matrix Profile')
axis[0].legend(loc ='upper left')
axis[1].plot(annotation_vector, color='slateblue', label='Annotation Vector')
axis[1].set_xlabel('Time')
axis[1].set_title('Annotation Vector')
plt.show()
As we can see, the corrected matrix profile is elevated around the region of physical movement. This is done to penalize this region. The lowest point in the matrix profile is no longer in the interval where the test subject physically moves. Now we can test the performance of this corrected matrix profile in locating our desirable motifs.
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Hemoglobin Concentration Time Series Motif Discovery With Corrected Matrix Profile', fontsize='15')
axis[0].plot(df['Hemoglobin Concentration'], color='lime', label='Hemoglobin Concentration')
axis[0].plot(motif[1], motif[2], color='red', linewidth=6, alpha=0.5, label='Motif')
axis[0].plot(neighbor[1], neighbor[2], color='mediumblue', linewidth=6, alpha=0.5, label='Nearest Neighbour')
axis[0].legend()
axis[0].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[0].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[0].set_ylabel('690 nm intensity')
axis[0].set_title('Hemoglobin Concentration')
axis[1].plot(corrected_matrix_profile[:, 0], color='mediumvioletred', label='Matrix Profile')
axis[1].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[1].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[1].set_xlabel('Time')
axis[1].set_ylabel('Z-Normalized Euclidean Distance')
axis[1].set_title('Corrected Matrix Profile')
plt.show()
As we can see, the nearest neighbor pairs have moved towards the left near the smaller peaks that correspond to more desirable data. We have successfully excluded the region of physical movement from our motif search.
In this example we have produced an annotation vector that completely neglects our undesirable region. There are cases, however, when we don't want to completely remove a region from the motif search, but merely "discourage" it. The next example explores this case.
Consider the EOG dataset containing electric potential data of the left-eye, sampled at 50 Hz. EOG, or Electrooculography, is a technique for measring the potential that exists between the front and back of the human eye. We can download and visualize the data to better understand its trend.
df = pd.read_csv('https://zenodo.org/record/5045252/files/eog.csv?download=1')
df.head()
Electric Potential | |
---|---|
0 | -0.980 |
1 | 2.941 |
2 | 2.941 |
3 | 2.941 |
4 | 2.941 |
plt.plot(df['Electric Potential'], color='dodgerblue', label='Electric Potential')
plt.xlabel('Time')
plt.ylabel('Voltage')
plt.title('Electric Potential')
plt.show()
You should notice that some points in the plot have flat tops or bottoms. Medically, this is not reflecting reality. The flat tops and bottoms are simply clipping regions, since the measured value exceeded the measurement range of a floating point number. We can locate these regions for visualization.
max_val = np.max(df['Electric Potential'])
min_val = np.min(df['Electric Potential'])
flat_values = np.zeros(len(df['Electric Potential']))
for i in range(len(df['Electric Potential'])):
if df['Electric Potential'][i] == max_val or df['Electric Potential'][i] == min_val:
flat_values[i] = df['Electric Potential'][i]
else:
flat_values[i] = np.nan
plt.plot(df['Electric Potential'], color='dodgerblue', label='Electric Potential')
plt.plot(flat_values, color='firebrick', alpha=0.5, linewidth=8, label='Clipping Regions')
plt.legend(loc ='upper right')
plt.xlabel('Time')
plt.ylabel('Voltage')
plt.title('Electric Potential')
plt.show()
We can compute the matrix profile and locate the best motif.
m = 450
matrix_profile = stump(df['Electric Potential'], m)
motif, neighbor = get_motif_data(df['Electric Potential'], matrix_profile, m)
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Electric Potential Time Series Motif Discovery With Matrix Profile', fontsize='15')
axis[0].plot(df['Electric Potential'], color='dodgerblue', label='Electric Potential')
axis[0].plot(motif[1], motif[2], color='red', linewidth=6, alpha=0.5, label='Motif')
axis[0].plot(neighbor[1], neighbor[2], color='mediumblue', linewidth=6, alpha=0.5, label='Nearest Neighbour')
axis[0].legend(loc='upper right')
axis[0].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[0].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[0].set_ylabel('Voltage')
axis[0].set_title('Electric Potential')
axis[1].plot(matrix_profile[:, 0], color='darkorange', label='Matrix Profile')
axis[1].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[1].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[1].set_xlabel('Time')
axis[1].set_ylabel('Z-Normalized Euclidean Distance')
axis[1].set_title('Matrix Profile')
plt.show()
Unsurprisingly, the nearest neighbors contain the biggest flat top of the time series. This makes sense, since the distance between two flat tops is very low.
Of course, because the flat tops or bottoms are areas where the actual value of the electric potential is unclear, we want to make this a less desirable motif. But the difference between the previous case with the Hemoglobin concentrations data and this, is that in this case we don't want to completely dismiss the clipping regions. Rather, we want to slightly penalize them.
We do this by constructing an annotation vector that is real-valued, as opposed to the previous example where it could only be 0 or 1. For every subsequence, we count the number of values that are equal to the global minimum or maximum. The more that number is, the lower we set the annotation vector value for that subsequence, and vice versa.
annotation_vector = np.zeros(len(df['Electric Potential']) - m + 1)
for i in range(len(df['Electric Potential']) - m + 1):
window = df['Electric Potential'][i:i + m]
annotation_vector[i] = np.count_nonzero(window == max_val) + np.count_nonzero(window == min_val)
annotation_vector = annotation_vector - np.min(annotation_vector)
annotation_vector = annotation_vector / np.max(annotation_vector)
annotation_vector = 1 - annotation_vector
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Electric Potential & Annotation Vector', fontsize='15')
axis[0].plot(df['Electric Potential'], color='dodgerblue', label='Electric Potential')
axis[0].set_ylabel('Voltage')
axis[0].set_title('Electric Potential')
axis[1].plot(annotation_vector, color='slateblue', label='Annotation Vector')
axis[1].set_xlabel('Time')
axis[1].set_title('Annotation Vector')
plt.show()
Now we have an annotation vector that is lower at subsequences that contain flat tops or bottoms, and higher where there are no flat tops or bottoms. In contrast to the annotation vector generated for the Hemoglobin concentrations example, this annotation vector can take any real-valued number between 0 and 1. We use this to generate the corrected matrix profile.
corrected_matrix_profile = get_corrected_matrix_profile(matrix_profile, annotation_vector)
motif, neighbor = get_motif_data(df['Electric Potential'], corrected_matrix_profile, m)
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Matrix Profile Correction Using Annotation Vector', fontsize='15')
axis[0].plot(matrix_profile[:, 0], color='darkorange', linewidth=5, alpha=0.3, label='Matrix Profile')
axis[0].plot(corrected_matrix_profile[:, 0], color='mediumvioletred', label='Corrected Matrix Profile')
axis[0].set_ylabel('Z-Normalized Euclidean Distance')
axis[0].set_title('Corrected & Original Matrix Profile')
axis[0].legend(loc ='upper left')
axis[1].plot(annotation_vector, color='slateblue', label='Annotation Vector')
axis[1].set_xlabel('Time')
axis[1].set_title('Annotation Vector')
plt.show()
Clearly, the subsequences with the lowest matrix profile value has changed. Now we compute the motif again.
fig, axis = plt.subplots(2, sharex=True)
plt.suptitle('Electric Potential Time Series Motif Discovery With Corrected Matrix Profile', fontsize='15')
axis[0].plot(df['Electric Potential'], color='dodgerblue', label='Electric Potential')
axis[0].plot(motif[1], motif[2], color='red', linewidth=6, alpha=0.5, label='Motif')
axis[0].plot(neighbor[1], neighbor[2], color='mediumblue', linewidth=6, alpha=0.5, label='Nearest Neighbour')
axis[0].legend()
axis[0].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[0].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[0].set_ylabel('Voltage')
axis[0].set_title('Electric Potential')
axis[1].plot(corrected_matrix_profile[:, 0], color='mediumvioletred', label='Corrected Matrix Profile')
axis[1].axvline(x=motif[0], color='grey', alpha=0.5, linestyle='--')
axis[1].axvline(x=neighbor[0], color='grey', alpha=0.5, linestyle='--')
axis[1].set_xlabel('Time')
axis[1].set_ylabel('Z-Normalized Euclidean Distance')
axis[1].set_title('Corrected Matrix Profile')
plt.show()
As we can see, the new motifs are no longer at flat top or bottom regions. Once again, we've redirected the algorithm to find the more interesting motif.
And that's it! You've now learnt the basics of guided motif search and annotation vectors, and the use of annotation vectors in solving simple cases of time series motif search biases.