debug_events
¶There is a new debug_events
parameter that can be passed into whylogs.log statement to define debug information that will be stored as JSON in WhyLabs and can be correlated to whylogs profiles with a common trace_id
value.
Consider the scenario where you may want to record additional information pertaining to your dataset, and maybe you also have some private data that you store separately in your own systems such as a database or filesystem. We do not want the raw private data stored in the dataset profile, but we need a way to trace back to a record for purposes of later debugging if there are alerts or constraint failures related to this data. With DebugEvents you can store small arbitrary JSON along with dataset profiles, and these debug events can be correlated with profiles or other external data by sharing a user supplied trace_id
. In practice this should be something like a uuid, url or other unique string that you can later use to lookup information that may be stored in your environment.
In the following example we setup a toy example where we profile a single test message with whylogs, but also attach to it a debug event. Additionally you can supply segment key value pairs to help partition this debug data, and tags for things like your environment to help make searching and filtering easier in debug scenarios.
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs[embeddings,viz,image]==1.3.5.dev0 -q # or version 1.3.8 or higher when released
from uuid import uuid4
import random
import whylogs as why
from whylogs.core.segmentation_partition import segment_on_column
from whylogs.core.schema import DatasetSchema
test_message = {"col1": "Green", "col2": 0.007}
trace_id = str(uuid4())
print(f"Running example with explicitly specified trace_id: {trace_id}")
test_debug_event = {
"custom_field": 1,
"custom_nested_field": {
"subA": random.random(),
"subB": random.random() + random.random()
},
"debug_notes": "Sometimes you might want to record a longer string and not lose the full value such as this verbose example."
}
debug_tags = ["dev", "demo_test"]
Running example with explicitly specified trace_id: d992c9fa-ac0d-4b12-8da8-fb4a534f984c
The debug_event parameter can be passed to why.log along with the message to be profiled. If there is a debug_event dictionary, that data will be sent to WhyLabs, so we must have env variables defined on where this should be logged.
import os
# Replace the empty strings to the right of the os.environ lines with your information
# or if you already have the env variables defined you can comment on these next three lines
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = ""
os.environ["WHYLABS_DEFAULT_ORG_ID"] = ""
os.environ["WHYLABS_API_KEY"] = ""
# the call to why.log returns the results for the test_message only,
# so this profile is a single message profile. The WhyLabs side configuration
# will determine if this profile is preserved for individual profile download
results = why.log(
test_message,
trace_id=trace_id,
tags=debug_tags,
debug_event=test_debug_event
)
Next we still need to write the results containing the statistical profile (summarization) of the message test_message
. We can also pass in a pandas dataframe as the first parameter to why.log
Let's look at what the results contain before uploading to WhyLabs with a write call. Note that the trace_id we specified earlier matches this 'whylabs.traceId' field in the metadata. That is something we can later use to query for both this profile and the debug event.
results.metadata
{'whylabs.traceId': 'd992c9fa-ac0d-4b12-8da8-fb4a534f984c', 'whylogs.creationTimestamp': '1695414037772', 'whylogs.user.tags': '["demo_test", "dev"]'}
Note that the above metadata is not the full debug_event by design. The profile results metadata is only the portion of data we need to correlate this profile with debug_events if there are any. All metadata keys and values need to be of type string, and non-string values such as timestamps will be converted to string.
Also note that the segment_key_values
, [optional parameter] are stored in a different part of the results so that they can be used to partition the data platform side.
results.writer("whylabs").write()
[(True, 'log-6tpJJelNhOgPdJsp')]