whylogs profiles contain summarized information about our data. This means that it's a lossy process, and once we get the profiles, we don't have access anymore to the complete set of data.
This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there are no textual information that matches a credit card number or email information. Or maybe you're interested in ensuring that there are no even numbers in a certain column. How do we do that if we don't have access to the complete data?
The answer is that you need to define a Condition Count Metric to be tracked before logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you'll have that information to check against the constraints you'll create.
In this example, you'll learn how to:
If you want more information on Condition Count Metrics, you can see this example and also the documentation for Data Validation
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Let's assume we have a DataFrame for which we wish to log standard metrics through whylogs' default logging process. But additionally, we want specific information on two columns:
url
: Regex pattern validation: the values in this column should always start with https:://www.mydomain.com/profile
subscription_date
: Date Format validation: the values in this column should be a string with a date format of %Y-%m-%d
In addition, we consider these cases to be critical, so we wish to make certain actions whenever the condition fails. In this example we will:
subscription_date
fails the conditionurl
is not from the domain we expectLet's first create a simple DataFrame to demonstrate:
import pandas as pd
data = {
"name": ["Alice", "Bob", "Charles"],
"age": [31,0,25],
"url": ["https://www.mydomain.com/profile/123", "www.wrongdomain.com", "http://mydomain.com/unsecure"],
"subscription_date": ["2021-12-28","2019-29-11","04/08/2021"],
}
df = pd.DataFrame(data)
In this case, both url
and subscription_date
has 2 values out of 3 that are not what we expect.
Let's first define the relations that will actually check whether the value passes our constraint. For the date format validation, we'll use the datetime module in a user defined function. As for the Regex pattern matching, we will use whylogs' Predicates
along with regular expressions, which allows us to build simple relations intuitively.
import datetime
from typing import Any
from whylogs.core.relations import Predicate
def date_format(x: Any) -> bool:
date_format = '%Y-%m-%d'
try:
datetime.datetime.strptime(x, date_format)
return True
except ValueError:
return False
# matches accept a regex expression
matches_domain_url = Predicate().matches("^https:\/\/www.mydomain.com\/profile")
Next, we need to define the actions that will be triggered whenever the conditions fail.
We will define two placeholder functions that, in a real scenario, would execute the defined actions.
from typing import Any
def pull_andon_cord(validator_name, condition_name: str, value: Any):
print("Validator: {}\n Condition name {} failed for value {}".format(validator_name, condition_name, value))
print(" Pulling andon cord....")
# Do something here to respond to the constraint violation
return
def send_slack_alert(validator_name, condition_name: str, value: Any):
print("Validator: {}\n Condition name {} failed for value {}".format(validator_name, condition_name, value))
print(" Sending slack alert....")
# Do something here to respond to the constraint violation
return
Conditions are defined by the combination of a relation and a set of actions. Now that we have both relations and actions, we can create two sets of conditions - in this example, each set contain a single condition, but we could have multiple.
from whylogs.core.metrics.condition_count_metric import Condition
has_date_format = {
"Y-m-d format": Condition(date_format, actions=[send_slack_alert]),
}
regex_conditions = {"url_matches_domain": Condition(matches_domain_url, actions=[pull_andon_cord,send_slack_alert])}
ints_conditions = {
"integer_zeros": Condition(Predicate().equals(0)),
}
Now, we need to let the logger aware of our Conditions. This can be done by creating a custom schema object that will be passed to why.log()
.
To create the schema object, we will use the Declarative Schema, which is an auxiliary class that will enable us to create a schema in a simple way.
In this case, we want our schema to start with the default behavior (standard metrics for the default datatypes). Then, we want to add two condition count metrics based on the conditions we defined earlier and the name of the column we want to bind those conditions to. We can do so by calling the schema's add_condition_count_metric
method:
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.schema import DeclarativeSchema
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="subscription_date", metrics=[ConditionCountMetricSpec(has_date_format)])
schema.add_resolver_spec(column_name="url", metrics=[ConditionCountMetricSpec(regex_conditions)])
schema.add_resolver_spec(column_name="age", metrics=[ConditionCountMetricSpec(ints_conditions)])
Now, let's pass the schema to why.log() and start logging our data:
import whylogs as why
profile_view = why.log(df, schema=schema).profile().view()
Validator: condition_count Condition name url_matches_domain failed for value www.wrongdomain.com Pulling andon cord.... Validator: condition_count Condition name url_matches_domain failed for value www.wrongdomain.com Sending slack alert.... Validator: condition_count Condition name url_matches_domain failed for value http://mydomain.com/unsecure Pulling andon cord.... Validator: condition_count Condition name url_matches_domain failed for value http://mydomain.com/unsecure Sending slack alert.... Validator: condition_count Condition name Y-m-d format failed for value 2019-29-11 Sending slack alert.... Validator: condition_count Condition name Y-m-d format failed for value 04/08/2021 Sending slack alert....
You can see that during the logging process, our actions were triggered whenever the condition failed. We can see the name of the failed condition and the specific value that triggered it.
We see the actions were triggered, but we also expect the Condition Count Metrics to be generated. Let's see if this is the case:
profile_view.to_pandas()
cardinality/est | cardinality/lower_1 | cardinality/upper_1 | condition_count/integer_zeros | condition_count/total | counts/inf | counts/n | counts/nan | counts/null | distribution/max | distribution/mean | distribution/median | distribution/min | distribution/n | distribution/q_01 | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | distribution/stddev | frequent_items/frequent_strings | ints/max | ints/min | type | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor | condition_count/Y-m-d format | condition_count/url_matches_domain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||||||||||||||||
age | 3.0 | 3.0 | 3.00015 | 1.0 | 3.0 | 0 | 3 | 0 | 0 | 31.0 | 18.666667 | 25.0 | 0.0 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 31.0 | 31.0 | 31.0 | 31.0 | 16.441817 | [FrequentItem(value='25', est=1, upper=1, lowe... | 31.0 | 0.0 | SummaryType.COLUMN | 0 | 0 | 3 | 0 | 0 | 0 | NaN | NaN |
name | 3.0 | 3.0 | 3.00015 | NaN | NaN | 0 | 3 | 0 | 0 | NaN | 0.000000 | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | [FrequentItem(value='Alice', est=1, upper=1, l... | NaN | NaN | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | NaN | NaN |
subscription_date | 3.0 | 3.0 | 3.00015 | NaN | 3.0 | 0 | 3 | 0 | 0 | NaN | 0.000000 | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | [FrequentItem(value='2019-29-11', est=1, upper... | NaN | NaN | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | 1.0 | NaN |
url | 3.0 | 3.0 | 3.00015 | NaN | 3.0 | 0 | 3 | 0 | 0 | NaN | 0.000000 | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | [FrequentItem(value='www.wrongdomain.com', est... | NaN | NaN | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | NaN | 1.0 |
At the far right of our summary dataframe, you can find the Condition Count Metrics: the Y-m-d format
condition was met only once of a total of 3. The same happens for the url_matches_domain
. Note that for columns where the condition was not defined, a NaN
is displayed.
So far, we created Condition Count Metrics for both of the desired conditions. During the logging process, the set of actions defined for each of the conditions were triggered whenever the conditions failed to be met.
Now, we wish to create Metric Constraints on top of the Condition Count Metrics, so we can generate a Constraints Report. This can be done by using the condition_meets
helper constraint. You only need to specify the column name and the name of the condition you want to check:
from whylogs.core.constraints.factories import condition_meets, condition_never_meets, condition_count_below
from whylogs.core.constraints import ConstraintsBuilder
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(condition_meets(column_name="subscription_date", condition_name="Y-m-d format"))
builder.add_constraint(condition_never_meets(column_name="url", condition_name="url_matches_domain"))
builder.add_constraint(condition_count_below(column_name="age", condition_name="integer_zeros", max_count=1))
constraints = builder.build()
constraints.generate_constraints_report()
[ReportResult(name='subscription_date meets condition Y-m-d format', passed=0, failed=1, summary=None), ReportResult(name='url never meets condition url_matches_domain', passed=0, failed=1, summary=None), ReportResult(name='age.integer_zeros lower than or equal to 1', passed=1, failed=0, summary=None)]
The condition_meets
constraint will fail if the said condition is not met at least once. In other words, if condition_count/condition_name
is smaller than condition_count/total
.
The condition_never_meets
constraint will fail if the said condition is met at least once. In other words, if condition_count/condition_name
is greater than 0.
The condition_count_below
constraint will fail if the said condition is met more than a specified number of times.
You can visualize the Constraints Report as usual by calling NotebookProfileVisualizer
's constraints_report
:
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
By hovering on the status, you can view the number of times the condition failed, and the total number of times the condition was checked.