🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
When logging data, whylogs outputs certain metrics according to the column type. While whylogs provide a default behaviour, you can configure it in order to only track metrics that are important to you.
In this example, we'll see how you can configure the Schema for a dataset level to control which metrics you want to calculate. We'll see how to specify metrics:
Per data type
Per column name
But first, let's talk briefly about whylogs' data types and basic metrics.
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Installing collected packages: whylogs-sketching, types-urllib3, types-requests, whylabs-client, whylogs Successfully installed types-requests-2.31.0.2 types-urllib3-1.26.25.14 whylabs-client-0.5.4 whylogs-1.3.0 whylogs-sketching-3.4.1.dev3
whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:
Anything that doesn't end up matching the above types will have an AnyType
type.
To check which type a certain Python type is mapped to in whylogs, you can use the StandardTypeMapper:
from whylogs.core.datatypes import StandardTypeMapper
type_mapper = StandardTypeMapper()
type_mapper(list)
<whylogs.core.datatypes.AnyType at 0x7dde641a70d0>
The standard metrics available in whylogs are grouped in namespaces. They are:
Now, let's see how we can control which metrics are tracked according to the column's type or column name.
Let's assume you're not interested in every metric listed above, and you have a performance-critical application, so you'd like to do as few calculations as possible.
For example, you might only be interested in:
Let's see how we can configure our Schema to track only the above metrics for the related types.
Let's create a sample dataframe to illustrate:
# Install pandas if you don't have it already
%pip install pandas
import pandas as pd
d = {"col1": [1, 2, 3], "col2": [3.0, 4.0, 5.0], "col3": ["a", "b", "c"], "col4": [3.0, 4.0, 5.0]}
df = pd.DataFrame(data=d)
whylogs uses Resolvers
in order to define how a column name or data type gets mapped to different metrics.
We will create a custom Resolver class in order to customize it.
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric
class MyCustomResolver(Resolver):
"""Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""
def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
metrics: List[StandardMetric] = [StandardMetric.counts, StandardMetric.types]
if isinstance(why_type, Fractional):
metrics.append(StandardMetric.distribution)
if isinstance(why_type, Integral):
metrics.append(StandardMetric.frequent_items)
result: Dict[str, Metric] = {}
for m in metrics:
result[m.name] = m.zero(column_schema.cfg)
return result
In the case above, the name
parameter is not being used, as the column name is not relevant to map the metrics, only the why_type
.
We basically initialize metrics
with metrics of both counts
and types
namespaces regardless of the data type. Then, we check for the whylogs data type in order to add the desired metric namespace (distribution
for Fractional columns and frequent_items
for Integral columns)
Now we can proceed with the normal process of logging a dataframe. Resolvers are passed to whylogs through a Dataset Schema
, so we can pass a DatasetSchema
object to log's schema
parameter as follows:
import whylogs as why
from whylogs.core import DatasetSchema
result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
WARNING:whylogs.api.whylabs.session.session_manager:No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.
counts/inf | counts/n | counts/nan | counts/null | frequent_items/frequent_strings | type | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor | distribution/max | distribution/mean | distribution/median | distribution/min | distribution/n | distribution/q_01 | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | distribution/stddev | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | ||||||||||||||||||||||||||
col1 | 0 | 3 | 0 | 0 | [FrequentItem(value='1', est=1, upper=1, lower... | SummaryType.COLUMN | 0 | 0 | 3 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col2 | 0 | 3 | 0 | 0 | NaN | SummaryType.COLUMN | 0 | 3 | 0 | 0 | 0 | 0 | 5.0 | 4.0 | 4.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 |
col3 | 0 | 3 | 0 | 0 | NaN | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col4 | 0 | 3 | 0 | 0 | NaN | SummaryType.COLUMN | 0 | 3 | 0 | 0 | 0 | 0 | 5.0 | 4.0 | 4.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 |
Notice we have counts
and types
metrics for every type, distribution
metrics only for col2
and col4
(floats) and frequent_items
only for col1
(ints).
That's precisely what we wanted.
Now, suppose we don't want to specify the tracked metrics per data type, and rather by each specific columns.
For example, we might want to track:
col1
col2
col3
col4
The process is similar to the previous case. We only need to change the if clauses to check for the name
instead of why_type
, like this:
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric
class MyCustomResolver(Resolver):
"""Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""
def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
metrics = []
if name=='col1':
metrics.append(StandardMetric.counts)
if name=='col2':
metrics.append(StandardMetric.distribution)
if name=='col3':
metrics.append(StandardMetric.cardinality)
if name=='col4':
metrics.append(StandardMetric.distribution)
metrics.append(StandardMetric.cardinality)
result: Dict[str, Metric] = {}
for m in metrics:
result[m.name] = m.zero(column_schema.cfg)
return result
Since there's no common metrics for all columns, we can initialize metrics
as an empty list, and then append the relevant metrics for each columns.
Now, we create a custom schema, just like before:
import whylogs as why
from whylogs.core import DatasetSchema
df['col5'] = 0
result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
counts/inf | counts/n | counts/nan | counts/null | type | distribution/max | distribution/mean | distribution/median | distribution/min | distribution/n | distribution/q_01 | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | distribution/stddev | cardinality/est | cardinality/lower_1 | cardinality/upper_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | ||||||||||||||||||||||
col1 | 0.0 | 3.0 | 0.0 | 0.0 | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col2 | NaN | NaN | NaN | NaN | SummaryType.COLUMN | 5.0 | 4.0 | 4.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 | NaN | NaN | NaN |
col3 | NaN | NaN | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 3.00015 |
col4 | NaN | NaN | NaN | NaN | SummaryType.COLUMN | 5.0 | 4.0 | 4.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 | 3.0 | 3.0 | 3.00015 |
col5 | NaN | NaN | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Note that existing columns that are not specified in your custom resolver won't have any metrics tracked. In the example above, we added a col5
column, but since we didn't link any metrics to it, all of the metrics are NaN
s.
In the previous section, we created subclasses of Resolver
and implemented its resolve()
method using control flow. The DeclarativeSchema
allows us to customize the metrics present in a column by simply listing the metrics we want by data type or column name without implementing a Resolver
subclass.
A ResolverSpec
specifies a list of metrics to use for columns that match it. We can match columns by name or by type. The column name takes precedence if both are given. Each ResolverSpec
has a list of MetricSpec
that specify the Metric
s (and optionally custom configurations) to apply to matching metrics. For example:
from whylogs.core.metrics.condition_count_metric import (
Condition,
ConditionCountConfig,
ConditionCountMetric,
)
from whylogs.core.relations import Predicate
from whylogs.core.resolvers import COLUMN_METRICS, MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.datatypes import AnyType, DataType, Fractional, Integral, String
X = Predicate()
schema = DeclarativeSchema(
[
ResolverSpec(
column_name="col1",
metrics=[
MetricSpec(StandardMetric.distribution.value),
MetricSpec(
ConditionCountMetric,
ConditionCountConfig(
conditions={
"below 42": Condition(lambda x: x < 42),
"above 42": Condition(lambda x: x > 42),
}
),
),
],
),
ResolverSpec(
column_type=String,
metrics=[
MetricSpec(StandardMetric.frequent_items.value),
MetricSpec(
ConditionCountMetric,
ConditionCountConfig(
conditions={
"alpha": Condition(X.matches("[a-zA-Z]+")),
"digit": Condition(X.matches("[0-9]+")),
}
),
),
],
),
]
)
d = {"col1": [1, 2, 3], "col2": [3.0, 4.0, 5.0], "col3": ["a", "b", "c"], "col4": [3.0, 4.0, 5.0]}
df = pd.DataFrame(data=d)
result = why.log(df, schema=schema)
prof_view = result.profile().view()
prof_view.to_pandas()
condition_count/above 42 | condition_count/below 42 | condition_count/total | distribution/max | distribution/mean | distribution/median | distribution/min | distribution/n | distribution/q_01 | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | distribution/stddev | type | condition_count/alpha | condition_count/digit | frequent_items/frequent_strings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
col1 | 0.0 | 3.0 | 3.0 | 3.0 | 2.0 | 2.0 | 1.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | SummaryType.COLUMN | NaN | NaN | NaN |
col2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN |
col3 | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | SummaryType.COLUMN | 3.0 | 0.0 | [FrequentItem(value='c', est=1, upper=1, lower... |
col4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN |
We can now pass schema
to why.log()
to log data according to the schema. Note that we pass the Metric
class to the the MetricSpec
constructor, not an instance. In this example, col1
will have a ConditionCountMetric
that tracks how often the column entries are above or below 42. Any string column will track how many entries are alphabetic and how many are numeric.
whylogs.core.resolvers.COLUMN_METRICS
is a list of MetricSpec
s for the metrics WhyLabs expects in each column. There are also some predefined ResolverSpec
lists to cover common use cases. For example, STANDARD_RESOLVER
specifies the same metrics as the StandardResolver
:
STANDARD_RESOLVER = [
ResolverSpec(
column_type=Integral,
metrics=COLUMN_METRICS
+ [
MetricSpec(StandardMetric.distribution.value),
MetricSpec(StandardMetric.ints.value),
MetricSpec(StandardMetric.cardinality.value),
MetricSpec(StandardMetric.frequent_items.value),
],
),
ResolverSpec(
column_type=Fractional,
metrics=COLUMN_METRICS
+ [
MetricSpec(StandardMetric.distribution.value),
MetricSpec(StandardMetric.cardinality.value),
],
),
ResolverSpec(
column_type=String,
metrics=COLUMN_METRICS
+ [
MetricSpec(StandardMetric.unicode_range.value),
MetricSpec(StandardMetric.distribution.value),
MetricSpec(StandardMetric.cardinality.value),
MetricSpec(StandardMetric.frequent_items.value),
],
),
ResolverSpec(column_type=AnyType, metrics=COLUMN_METRICS),
]
There are also declarations for
LIMITED_TRACKING_RESOLVER
just tracks the metrics required by WhyLogs, plus the distribution metric for numeric columns.NO_FI_RESOLVER
is the same as STANDARD_RESOLVER
but omits the frequent item metrics.HISTOGRAM_COUNTING_TRACKING_RESOLVER
tracks only the distribution metric for each column.These provide handy starting places if we just want to add one or two metrics to one of these standard schema using the add_resolver()
method:
from whylogs.core.resolvers import STANDARD_RESOLVER
schema = DeclarativeSchema(STANDARD_RESOLVER)
extra_metric = ResolverSpec(
column_name="col1",
metrics=[
MetricSpec(StandardMetric.distribution.value),
MetricSpec(
ConditionCountMetric,
ConditionCountConfig(
conditions={
"below 42": Condition(lambda x: x < 42),
"above 42": Condition(lambda x: x > 42),
}
),
),
],
)
schema.add_resolver(extra_metric)
result = why.log(df, schema=schema)
prof_view = result.profile().view()
prof_view.to_pandas()
WARNING:whylogs.core.resolvers:Conflicting resolvers for distribution metric in column 'col1' of type int
cardinality/est | cardinality/lower_1 | cardinality/upper_1 | condition_count/above 42 | condition_count/below 42 | condition_count/total | counts/inf | counts/n | counts/nan | counts/null | distribution/max | distribution/mean | distribution/median | distribution/min | distribution/n | distribution/q_01 | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | distribution/stddev | frequent_items/frequent_strings | ints/max | ints/min | type | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | ||||||||||||||||||||||||||||||||||
col1 | 3.0 | 3.0 | 3.00015 | 0.0 | 3.0 | 3.0 | 0 | 3 | 0 | 0 | 3.0 | 2.0 | 2.0 | 1.0 | 3 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | [FrequentItem(value='1', est=1, upper=1, lower... | 3.0 | 1.0 | SummaryType.COLUMN | 0 | 0 | 3 | 0 | 0 | 0 |
col2 | 3.0 | 3.0 | 3.00015 | NaN | NaN | NaN | 0 | 3 | 0 | 0 | 5.0 | 4.0 | 4.0 | 3.0 | 3 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 | NaN | NaN | NaN | SummaryType.COLUMN | 0 | 3 | 0 | 0 | 0 | 0 |
col3 | 3.0 | 3.0 | 3.00015 | NaN | NaN | NaN | 0 | 3 | 0 | 0 | NaN | 0.0 | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | [FrequentItem(value='c', est=1, upper=1, lower... | NaN | NaN | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 |
col4 | 3.0 | 3.0 | 3.00015 | NaN | NaN | NaN | 0 | 3 | 0 | 0 | 5.0 | 4.0 | 4.0 | 3.0 | 3 | 3.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 | NaN | NaN | NaN | SummaryType.COLUMN | 0 | 3 | 0 | 0 | 0 | 0 |
This example adds a condition count metric to col1
in addition to the usual default metrics.
If you instantiate a DeclarativeResolver
without passing it a list of ResolverSpec
s, it will use the value of the variable whylogs.core.resovlers.DEFAULT_RESOLVER
. Initially this has the value of STANDARD_RESOLVER
which matches whylog's default behavior. You can set the value to one of the other pre-defined resolver lists or your own custom resolver list to customize the default resolving behavior.
Similarly, there is a whylogs.experimental.core.metrics.udf_metric.DEFAULT_UDF_RESOLVER
variable that specifies the default resolvers for the submetrics in a UdfMetric
.
The ResolverSpec
has an exclude
field. If this is set to true, the metrics listed in the ResolverSpec
are excluded from columns that match it. This can be handy for preventing sensitive information from "leaking" via a frequent items metric:
from whylogs.core.resolvers import DEFAULT_RESOLVER
data = pd.DataFrame({"Sensitive": ["private", "secret"], "Boring": ["normal", "stuff"]})
schema = DeclarativeSchema(
DEFAULT_RESOLVER + [ResolverSpec(
column_name = "Sensitive",
metrics = [MetricSpec(StandardMetric.frequent_items.value)],
exclude = True
)]
)
result = why.log(data, schema=schema)
result.profile().view().to_pandas()["frequent_items/frequent_strings"]
column Boring [FrequentItem(value='normal', est=1, upper=1, ... Sensitive NaN Name: frequent_items/frequent_strings, dtype: object
The frequent items metrics has been excluded from the Sensitive
column without affecting the DEFAULT_RESOLVER
's treatment of other columns.