When logging data, whylogs outputs certain metrics according to the column type. While whylogs provide a default behaviour, you can configure it in order to only track metrics that are important to you.
In this example, we'll see how you can configure the Schema for a dataset level to control which metrics you want to calculate. We'll see how to specify metrics:
Per data type
Per column name
But first, let's talk briefly about whylogs' data types and basic metrics.
whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:
Anything that doesn't end up matching the above types will have an AnyType
type.
If you want to check to which type a certain Python type is mapped to whylogs, you can use the StandardTypeMapper:
from whylogs.core.datatypes import StandardTypeMapper
type_mapper = StandardTypeMapper()
type_mapper(list)
<whylogs.core.datatypes.AnyType at 0x7efe0fb48ca0>
The standard metrics available in whylogs are grouped in namespaces. They are:
Now, let's see how we can control which metrics are tracked according to the column's type or column name.
Let's assume you're not interested in every metric listed above, and you have a performance-critical application, so you'd like to do as few calculations as possible.
For example, you might only be interested in:
Let's see how we can configure our Schema to track only the above metrics for the related types.
Let's create a sample dataframe to illustrate:
import pandas as pd
d = {"col1": [1, 2, 3], "col2": [3.0, 4.0, 5.0], "col3": ["a", "b", "c"], "col4": [3.0, 4.0, 5.0]}
df = pd.DataFrame(data=d)
whylogs use Resolvers
in order to define how a column name or data type gets mapped to different metrics.
We will need to create a custom Resolver class in order to customize it.
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric
class MyCustomResolver(Resolver):
"""Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""
def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
metrics: List[StandardMetric] = [StandardMetric.counts, StandardMetric.types]
if isinstance(why_type, Fractional):
metrics.append(StandardMetric.distribution)
if isinstance(why_type, Integral):
metrics.append(StandardMetric.frequent_items)
result: Dict[str, Metric] = {}
for m in metrics:
result[m.name] = m.zero(column_schema)
return result
In the case above, the name
parameter is not being used, as the column name is not relevant to map the metrics, only the why_type
.
We basically initialize metrics
with metrics of both counts
and types
namespaces regardless of the data type. Then, we check for the whylogs data type in order to add the desired metric namespace (distribution
for Fractional columns and frequent_items
for Integral columns)
Resolvers are passed to whylogs through a Dataset Schema
, so we'll have to create a custom Schema as well.
In this case, since we're only interested in the resolvers, we could create a custom schema as follows:
from whylogs.core import DatasetSchema
class MyCustomSchema(DatasetSchema):
resolvers = MyCustomResolver()
Now we can proceed with the normal process of logging a dataframe, remembering to pass our schema when making the log
call:
import whylogs as why
result = why.log(df, schema=MyCustomSchema())
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | distribution/mean | distribution/stddev | distribution/n | distribution/max | distribution/min | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | type | frequent_items/frequent_strings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||
col2 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 4.0 | 1.0 | 3.0 | 5.0 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 | 5.0 | SummaryType.COLUMN | NaN |
col4 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 4.0 | 1.0 | 3.0 | 5.0 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 | 5.0 | SummaryType.COLUMN | NaN |
col3 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | SummaryType.COLUMN | NaN |
col1 | 3 | 0 | 3 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | SummaryType.COLUMN | [FrequentItem(value='1.000000', est=1, upper=1... |
Notice we have counts
and types
metrics for every type, distribution
metrics only for col2
and col4
(floats) and frequent_items
only for col1
(ints).
That's precisely what we wanted.
Now, suppose we don't want to specify the tracked metrics per data type, and rather by each specific columns.
For example, we might want to track:
col1
col2
col3
col4
The process is similar to the previous case. We only need to change the if clauses to check for the name
instead of why_type
, like this:
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric
class MyCustomResolver(Resolver):
"""Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""
def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
metrics = []
if name=='col1':
metrics.append(StandardMetric.counts)
if name=='col2':
metrics.append(StandardMetric.distribution)
if name=='col3':
metrics.append(StandardMetric.cardinality)
if name=='col4':
metrics.append(StandardMetric.distribution)
metrics.append(StandardMetric.cardinality)
result: Dict[str, Metric] = {}
for m in metrics:
result[m.name] = m.zero(column_schema)
return result
Since there's no common metrics for all columns, we can initialize metrics
as an empty list, and then append the relevant metrics for each columns.
Now, we create a custom schema, just like before:
class MyCustomSchema(DatasetSchema):
resolvers = MyCustomResolver()
import whylogs as why
df['col5'] = 0
result = why.log(df, schema=MyCustomSchema())
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
counts/n | counts/null | type | distribution/mean | distribution/stddev | distribution/n | distribution/max | distribution/min | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | ||||||||||||||||
col1 | 3.0 | 0.0 | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col5 | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col4 | NaN | NaN | SummaryType.COLUMN | 4.0 | 1.0 | 3.0 | 5.0 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 | 5.0 | 3.0 | 3.00015 | 3.0 |
col3 | NaN | NaN | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | 3.00015 | 3.0 |
col2 | NaN | NaN | SummaryType.COLUMN | 4.0 | 1.0 | 3.0 | 5.0 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 | 5.0 | NaN | NaN | NaN |
Note that existing columns that are not specified in your custom resolver won't have any metrics tracked. In the example above, we added a col5
column, but since we didn't link any metrics to it, all of the metrics are NaN
s.