🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
With Condition Validators, the user is able to evaluate conditions on individual values on real-time scenarios. These checks are done while data is being logged, and can trigger one or multiple actions when these conditions fail to be met. With Condition Validators, you are able to define actions where an immediate response is required, such as emiting an alert to key stakeholders, logging specific failures or throwing exceptions. Validators are designed with flexibility in mind, so you are free to customize your actions as well as the conditions that trigger those actions.
In this example, we will cover how to:
Showing the different types of conditions is NOT the focus of this example. If you wish to see the different types of conditions you can define, please refer to Condition Count Metrics.
Unlike metrics, validators will not log properties into profiles. They are meant only to evaluate conditions and trigger actions while logging is under way.
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) Requirement already satisfied: whylogs in /mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages (1.1.27) Requirement already satisfied: whylogs-sketching>=3.4.1.dev3 in /mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (3.4.1.dev3) Requirement already satisfied: typing-extensions>=3.10 in /mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (4.4.0) Requirement already satisfied: protobuf>=3.19.4 in /mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (4.21.12) WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -leach (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -hylabs-client (/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/lib/python3.8/site-packages) WARNING: You are using pip version 22.0.4; however, version 23.0.1 is available. You should consider upgrading via the '/mnt/c/Users/felip/Documents/Projects-WhyLabs/whylogs/python/.venv/bin/python -m pip install --upgrade pip' command. Note: you may need to restart the kernel to use updated packages.
In this simple scenario, we want to make sure two things happen:
emails
column (nothing else)transcriptions
columnWe'll use the following sample dataframe to evaluate on:
import pandas as pd
text_data = {
"emails": [
"my email is my_email_1989@gmail.com",
"invalidEmail@xyz.toolong",
"this.is.ok@hotmail.com",
"not an email",
],
"transcriptions": [
"Bob's credit card number is 4000000000000",
"Alice's credit card is XXXXXXXXXXXXX",
"Hi, my name is Bob",
"Hi, I'm Alice",
],
}
df = pd.DataFrame(data=text_data)
Let's translate the mentioned conditions into regex expressions:
Our conditions are usually expected to evaluate to
True
. When something goes wrong, the condition should evaluate toFalse
, triggering a certain action in the process. This is why we negate the first condition (because matching the pattern is bad) and do a match for the second one (because not finding an email is bad)
from whylogs.core.relations import Not, Predicate
X = Predicate()
credit_card_conditions = {"noCreditCard": Not(X.matches(".*4[0-9]{12}(?:[0-9]{3})?"))}
email_conditions = {"hasEmail": X.fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")}
Note: The regex expressions are for demonstrational purposes only. These expressions are not general - there will be emails and credit cards whose patterns will not be met by the expression.
The action to be triggered when a contidion fails is created by simply defining a regular function.
We should just remember to define the arguments: validator_name
, condition_name
and value
. You can use these values to help with logging and debugging the failures.
from typing import Any
def do_something_important(validator_name, condition_name: str, value: Any):
print("Validator: {}\n Condition name {} failed for value {}".format(validator_name, condition_name, value))
return
To create a Condition Validator, we need a name, a set of conditions, and a list of actions.
Let's make a Validator for the credit card column and another Validator for the email column. Each validator has a single condition to be evaluated, and also a single action.
Note that for a single validator, we could have multiple conditions defined and also multiple actions to be triggered.
from whylogs.core.validators import ConditionValidator
credit_card_validator = ConditionValidator(
name="no_credit_cards",
conditions=credit_card_conditions,
actions=[do_something_important],
)
email_validator = ConditionValidator(
name="has_emails",
conditions=email_conditions,
actions=[do_something_important],
)
Each validator instance should be mapped to a single column, but each column can have multiple validators attached to it.
Assigning an instance to multiple columns will lead to an undefined behavior.
In our case, we have only one validator for each of the columns:
validators = {
"emails": [email_validator],
"transcriptions": [credit_card_validator]}
Now, we only need to pass our set of validators to our DatasetSchema.
This will make the validators to be applied while data is being logged. The actions will be triggered immediately when the conditions fail, and not only when the logging is done.
from whylogs.core.schema import DatasetSchema
import whylogs as why
schema = DatasetSchema(validators=validators)
profile = why.log(df, schema=schema).profile()
Validator: has_emails Condition name hasEmail failed for value my email is my_email_1989@gmail.com Validator: has_emails Condition name hasEmail failed for value invalidEmail@xyz.toolong Validator: has_emails Condition name hasEmail failed for value not an email Validator: no_credit_cards Condition name noCreditCard failed for value Bob's credit card number is 4000000000000
We can see in the results above that our has_emails
validator failed three times. The first time, the value has extra text, the second has an invalid email address and the third does not contain an email.
The no_credit_cards
validator failed once, because the pattern was found once.
We can also access a simple summary with the total number of evaluations, the number of total failures and the number of failures per condition present in the validator:
email_validator.to_summary_dict()
{'total_evaluations': 4, 'hasEmail': 3}
credit_card_validator.to_summary_dict()
{'total_evaluations': 4, 'noCreditCard': 1}
The validator retain contextual information about the data that failed the conditions. You can access it by using the get_samples
method of the validator.
email_validator.get_samples()
['my email is my_email_1989@gmail.com', 'invalidEmail@xyz.toolong', 'not an email']
Note that the samples are stored in the validator instance, but they are not logged into the profile.
By default, the ConditionValidator
will sample 10 rows that failed the condition by using a Reservoir Sampler. You can change this by setting the validator_sample_size
in the ConditionValidatorConfig
.
If you want, you can also assign an identity_column
to the validator. You can use the identity column for two purposes:
Let's see how this works. First,let's create a dataframe again. This time, we have a column that contains the ids for each row:
import pandas as pd
text_data = {
"emails": [
"my email is my_email_1989@gmail.com",
"invalidEmail@xyz.toolong",
"this.is.ok@hotmail.com",
"not an email",
],
"ids": [
"id_0",
"id_1",
"id_2",
"id_3",
],
}
df = pd.DataFrame(data=text_data)
We will only use the email validator for this example.
Notice that now we are defining a column that contains our ids. We want to access those values in both our actions and in our sampling.
Let's define the validator again, but now with an identity column.
In the following block, there are two main differences:
enable_sampling=True
when instantiating the validator. This is by default True, but we're setting it explicitly for demonstration purposes. If you set this to False
, the validator won't sample the failed rows.from typing import Any
from whylogs.core.validators import ConditionValidator
def do_something_important(validator_name, condition_name: str, value: Any, row_id: Any = None):
print("Validator: {}\n Condition name {} failed for value {} and row id {}".format(validator_name, condition_name, value, row_id))
return
email_validator = ConditionValidator(
name="has_emails",
conditions=email_conditions,
actions=[do_something_important],
enable_sampling=True,
sample_size=2,
)
validators = {
"emails": [email_validator],
}
Now, we need to let whylogs know which column is our identity column. We do this by setting the identity_column
in our MetricConfig
:
from whylogs.core.schema import DatasetSchema
import whylogs as why
from whylogs.core.metrics import MetricConfig
condition_count_config = MetricConfig(identity_column="ids")
schema = DatasetSchema(validators=validators,default_configs=condition_count_config)
profile = why.log(df, schema=schema).profile()
samples = email_validator.get_samples()
print(f"Samples of failed rows: {samples}")
Validator: has_emails Condition name hasEmail failed for value my email is my_email_1989@gmail.com and row id id_0 Validator: has_emails Condition name hasEmail failed for value invalidEmail@xyz.toolong and row id id_1 Validator: has_emails Condition name hasEmail failed for value not an email and row id id_3 Samples of failed rows: ['id_3', 'id_0']