🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
High dimensional embedding spaces can be difficult to understand because we often rely on our own subjective judgement of clusters in the space. Often, data scientists try to find issues solely by hovering over individual data points and noting trends in which ones feel out of place.
In whylogs, you are able to profile embeddings values by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical "ideal" representations of a cluster or scenario) but can also be chosen programmatically.
For convenience, we include helper functions to select reference data points for comparing new embedding vectors against. To follow this notebook in full, install the embeddings
extra (for helper functions) and viz
extra (for visualizing drift) when installing whylogs.
# Note: you may need to restart the kernel to use updated packages.
%pip install --upgrade whylogs[embeddings,viz]
import os
import pickle
from sklearn.datasets import fetch_openml
if os.path.exists("mnist_784_X_y.pkl"):
X, y = pickle.load(open("mnist_784_X_y.pkl", 'rb'))
else:
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
Instead of training a model, we'll use the same functionality to split our dataset into an original training dataset and data we'll see in our first day of production.
from sklearn.model_selection import train_test_split
X_train, X_prod, y_train, y_prod = train_test_split(X, y, test_size=0.1)
We would like to compare incoming embeddings against up to 30 predefined references. These can chosen by the user either manually or algorithmically. Both reference selection algorithms provided are conducted on raw data, but only for the purposes of finding references itself.
If we had prototypical examples of digits that we wanted to compare our incoming data against, we would collect those data points now.
If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, PCACentroidSelector
, that finds the centroids in PCA space before converting back to the raw 784-dimensional space.
Let's utilize the labels available in the dataset for determining our references.
from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector
references, labels = PCACentroidsSelector(n_components=20).calculate_references(X_train, y_train)
If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, PCAKMeansSelector
, that finds the unsupervised centroids in PCA space then converting back to raw space.
We'll also calculate these but will elect to use the supervised version for the rest of the notebook.
from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector
unsup_references, unsup_labels = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X_train, y_train)
As with other advanced features, we can create a DeclarativeSchema
to tell whylogs to resolve columns of a certain name to the EmbeddingMetric
that we want to use.
We must pass our references, labels, and preferred distance function (either cosine distance or Euclidean distance) as parameters to EmbeddingConfig
then log as normal.
import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
DistanceFunction,
EmbeddingConfig,
EmbeddingMetric,
)
config = EmbeddingConfig(
references=references,
labels=labels,
distance_fn=DistanceFunction.euclidean,
)
schema = DeclarativeSchema([ResolverSpec(column_name="pixel_values", metrics=[MetricSpec(EmbeddingMetric, config)])])
train_profile = why.log(row={"pixel_values": X_train}, schema=schema)
Let's confirm the contents of our profile measures the distribution of embeddings relative to the references we've provided.
train_profile_view = train_profile.view()
column = train_profile_view.get_column("pixel_values")
summary = column.to_summary_dict()
for digit in [str(i) for i in range(10)]:
mean = summary[f'embedding/{digit}_distance:distribution/mean']
stddev = summary[f'embedding/{digit}_distance:distribution/stddev']
print(f"{digit} distance: mean {mean} stddev {stddev}")
This distance approach can be really powerful for measuring drift across new batches of embeddings in a programmatic way using drift metrics as well as the WhyLabs Observability Platform.
We'll look at a single example where an engineer introduces a change to reduce the amount of unnecessary processing by filtering out images where more than 90% of pixels are zeros. This is a realistic cleaning step that might be added to an ML pipeline, but will have a detrimental impact on our incoming data, especially the 1s.
# Find which digits have more than or equal to 90% missing
not_empty_mask = (X_prod == 0).sum(axis=1) <= (0.9 * 784)
X_prod = X_prod[not_empty_mask]
y_prod = y_prod[not_empty_mask]
# Log production digits using the same schema
prod_profile_view = why.log(row={"pixel_values": X_prod}, schema=schema).profile().view()
Let's look at this using inside of the whylogs profile view objects:
train_profile_summary = train_profile_view.get_column("pixel_values").to_summary_dict()
prod_profile_summary = prod_profile_view.get_column("pixel_values").to_summary_dict()
for digit in [str(i) for i in range(10)]:
mean_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/mean'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/mean']
stddev_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/stddev'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/stddev']
print(f"{digit} distance difference (target-prod): mean {mean_diff} stddev {stddev_diff}")
This particular drift has shown up in the distances to our reference data points as we'd expect. In particular, the 1s seem most affected by our rule.
See example notebook for monitoring your profiles continuously with the WhyLabs Observability Platform.
Consider comparing this profile to different transformations and subsets of our MNIST dataset: randomly selected subsets of the data, normalized values, missing one or more labels, sorted values, and more.
Go to the examples page for the complete list of examples!