🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
By default, whylogs tracks several metrics, such as type counts, distribution metrics, cardinality and frequent items. Those are general metrics that are useful for a lot of use cases, but often we need metrics tailored for our application.
Condition Count Metrics gives you the flexibility to define your own customized metrics. It will return the results as counters, which is the number of times the condition was met for a given column. With it, you can define conditions such as regex matching for strings, equalities or inequalities for numerical features, and even define your own function to check for any given condition.
In this example, we will cover:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Let's import all the dependencies for this example upfront:
import pandas as pd
from typing import Any
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Not, Predicate
from whylogs.core.schema import DeclarativeSchema
Suppose we have textual columns in our data in which we want to make sure certain elements are present / not present.
For example, for privacy and security issues, we might be interested in tracking the number of times a credit card number appears on a given column, or if we have sensitive email information in another column.
With whylogs, we can define metrics that will count the number of occurences a certain regex pattern is met for a given column.
Let's create a simple dataframe.
In this scenario, the emails
column should have only a valid email, nothing else. As for the trascriptions
column, we want to make sure existing credit card number was properly masked or removed.
data = {
"emails": ["my email is my_email_1989@gmail.com","invalidEmail@xyz.toolong","this.is.ok@hotmail.com","not an email"],
"transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)
The conditions are defined through a whylogs' Condition
object. There are several different ways of assembling a condition. In the following example, we will define two different regex patterns, one for each column. Since we can define multiple conditions for a single column, we'll assemble the conditions into dictionaries, where the key is the condition name. Each dictionary will be later attached to the relevant column.
emails_conditions = {
"containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}
transcriptions_conditions = {
"containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}
whylogs must be aware of those conditions while profiling the data. We can do that by creating a Standard Schema, and then simply adding the conditions to the schema with add_resolver_spec
. That way, we can pass our enhanced schema when calling why.log()
later.
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
Note: The regex expressions are for demonstrational purposes only. These expressions are not general - there will be emails and credit cards whose patterns will not be met by the expression.
Now, we only need to pass our schema when logging our data. Let's also take a look at the metrics, to make sure everythins was tracked correctly:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
condition_count/containsEmail | condition_count/containsCreditCard | condition_count/total | |
---|---|---|---|
column | |||
emails | 1.0 | NaN | 4 |
transcriptions | NaN | 1.0 | 4 |
Let's check the numbers:
For emails
feature, only one occurence was met for containsEmail
. That is expected, because the only valid row is the third one ("this.is.ok@hotmail.com"). Others either don't contain an email, are invalid emails or have extra text that are not an email (note we're using fullmatch
as the predicate for the email condition).
For transcriptions
column, we also have only one match. That is well, since only the first row has a match with the given pattern, and others either don't have a credit card number or are properly "hidden". Note that in this case we want to check for the pattern inside a broader text, so we're using .*
before the pattern, so the text doesn't have to start with the pattern (whylogs' Predicate.matches
uses python's re.compile().match()
under the hood.)
The available relations for regex matching are the ones used in this example:
matches
fullmatch
For this one, let's create integer and floats columns:
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
As before, we will create our set of conditions for each column and pass both to our schema:
ints_conditions = {
"equals42": Condition(Predicate().equals(42)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
floats_conditions = {
"equals42.2": Condition(Predicate().equals(42.2)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])
Let's log and check the metrics:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]
types/fractional | types/integral | condition_count/lessthan5 | condition_count/morethan40 | condition_count/equals42 | condition_count/equals42.2 | condition_count/total | |
---|---|---|---|---|---|---|---|
column | |||||||
floats_column | 4 | 0 | 2 | 1 | NaN | 1.0 | 4 |
ints_column | 0 | 4 | 2 | 1 | 1.0 | NaN | 4 |
We can simply check the original data to verify that the metrics are correct.
We used equals
, less_than
and greater_than
in this example, but here's the complete list of available relations:
equals
- equal toless_than
- less thanless_or_equals
- less than or equal togreater_than
- greater thangreater_or_equals
- greater than or equal tonot_equal
- not equal toYou can also combine relations with logical operators such as AND, OR and NOT.
Let's stick with the numerical features to show how you can combine relations to assemble conditions such as:
conditions = {
"between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
"outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
"not_42": Condition(Not(Predicate().equals(42))), # could also use X.not_equal(42) or X.not_.equals(42)
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]
condition_count/between10and50 | condition_count/outside10and50 | condition_count/not_42 | condition_count/total | |
---|---|---|---|---|
column | ||||
floats_column | 2 | 2 | 4 | 4 |
ints_column | 2 | 2 | 3 | 4 |
Available logical operators are:
and_
or_
not_
Not
Note that and_
, or_
, and not_
are methods called on a Predicate
and passed another Predicate
, while Not
is a function that takes a single Predicate
argument.
Even though we showed these operators with numerical features, this also works with regex matching conditions shown previously.
If none of the previously conditions are suited to your use case, you are free to define your own custom function to create your own metrics.
Let's see a simple example: suppose we want to check if a certain number is even.
We can define a even
predicate function, as simple as:
def even(x: Any) -> bool:
return x % 2 == 0
And then we proceed as usual, defining our condition and adding it to the schema:
We only have to pass the name of the function to conditions
as a Condition
object, like below:
conditions = {
"isEven": Condition(Predicate().is_(even)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isEven', 'condition_count/total']]
condition_count/isEven | condition_count/total | |
---|---|---|
column | ||
floats_column | 0 | 4 |
ints_column | 3 | 4 |
For user-defined functions, the sky's the limit for what you can do.
Let's think of another simple cenario for NLP. Suppose our model assumes text to be a certain way. Maybe it was trained and expects:
Let's check these conditions for the data below:
data = {
"transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
"ints": [0,1,2,3],
}
df = pd.DataFrame(data=data)
Once again, let's define our function:
def preprocessed(x: Any) -> bool:
stopwords = ["i", "me", "myself"]
if not isinstance(x, str):
return False
# should have only lowercase letters and space (no digits)
if not all(c.islower() or c.isspace() for c in x):
return False
# should not contain any words in our stopwords list
if any(c in stopwords for c in x.split()):
return False
return True
Since this is an example, our
stopwords
list is only a placeholder for the real thing.
The rest is the same as before:
conditions = {
"isPreprocessed": Condition(Predicate().is_(preprocessed)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed', 'condition_count/total']]
condition_count/isPreprocessed | condition_count/total | |
---|---|---|
column | ||
ints | NaN | NaN |
transcriptions | 1.0 | 4.0 |
For the transcriptions
feature, we can see that only the second row is properly preprocessed ("am alice and am xx years old"). The first one contained uppercase characters, the third contained a stopword and the last one contained digits. For the integers column, isPreprocessed returns 0, since it's not a string value.
You can combine this example with other whylogs' features to cover even more scenarios.
Here are some pointers for some possible use cases:
containsCreditCardNumber
should always be 0). Check the Metric Constraints with Condition Count Metrics example!Here are the complete code snippets - just to make it easier to copy/paste!
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
data = {
"emails": ["my email is my_email_1989@gmail.com","invalidEmail@xyz.toolong","this.is.ok@hotmail.com","not an email"],
"transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)
emails_conditions = {
"containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}
transcriptions_conditions = {
"containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
condition_count/containsEmail | condition_count/containsCreditCard | condition_count/total | |
---|---|---|---|
column | |||
emails | 1.0 | NaN | 4 |
transcriptions | NaN | 1.0 | 4 |
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
ints_conditions = {
"equals42": Condition(Predicate().equals(42)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
floats_conditions = {
"equals42.2": Condition(Predicate().equals(42.2)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]
types/fractional | types/integral | condition_count/lessthan5 | condition_count/morethan40 | condition_count/equals42 | condition_count/equals42.2 | condition_count/total | |
---|---|---|---|---|---|---|---|
column | |||||||
floats_column | 4 | 0 | 2 | 1 | NaN | 1.0 | 4 |
ints_column | 0 | 4 | 2 | 1 | 1.0 | NaN | 4 |
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.relations import Not
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
conditions = {
"between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
"outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
"not_42": Condition(Not(Predicate().equals(42))), # could also use X.not_equal(42) or X.not_.equals(42)
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]
condition_count/between10and50 | condition_count/outside10and50 | condition_count/not_42 | condition_count/total | |
---|---|---|---|---|
column | ||||
floats_column | 2 | 2 | 4 | 4 |
ints_column | 2 | 2 | 3 | 4 |
import pandas as pd
from typing import Any
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
def even(x: Any) -> bool:
return x % 2 == 0
def preprocessed(x: Any) -> bool:
stopwords = ["i", "me", "myself"]
if not isinstance(x, str):
return False
# should have only lowercase letters and space (no digits)
if not all(c.islower() or c.isspace() for c in x):
return False
# should not contain any words in our stopwords list
if any(c in stopwords for c in x.split()):
return False
return True
data = {
"transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
transcriptions_conditions = {
"isPreprocessed": Condition(Predicate().is_(preprocessed)),
}
numerical_conditions = {
"isEven": Condition(Predicate().is_(even)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed','condition_count/isEven', 'condition_count/total']]
condition_count/isPreprocessed | condition_count/isEven | condition_count/total | |
---|---|---|---|
column | |||
floats_column | NaN | 0.0 | 4 |
ints_column | NaN | 3.0 | 4 |
transcriptions | 1.0 | NaN | 4 |