🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
By default, columns of type str
will have the following metrics, when logged with whylogs:
In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as mean
, stddev
and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length.
In this example, we're interested in tracking two specific ranges of characters:
For more info on the unicode list of characters, check this Wikipedia Article
If you haven't already, install whylogs:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:
onlyDigits
: Column of strings that contain only digit charactersonlyAlpha
: Column of strings that contain only latin letters (no digits)mixed
: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbolsimport whylogs as why
import pandas as pd
data = {
"onlyDigits": ["12", "83", "1", "992", "7"],
"onlyAlpha": ["Alice", "Bob", "Chelsea", "Danny", "Eddie"],
"mixed": ["my_email_1989@gmail.com","ADK-1171","Copacabana 272 - Rio de Janeiro","21º C Friday - Sao Paulo, Brasil","18127819ASW"]
}
df = pd.DataFrame(data)
df.head()
onlyDigits | onlyAlpha | mixed | |
---|---|---|---|
0 | 12 | Alice | my_email_1989@gmail.com |
1 | 83 | Bob | ADK-1171 |
2 | 1 | Chelsea | Copacabana 272 - Rio de Janeiro |
3 | 992 | Danny | 21º C Friday - Sao Paulo, Brasil |
4 | 7 | Eddie | 18127819ASW |
whylogs uses Resolvers
in order to define the set of metrics tracked for a column name or data type.
In this case, we'll create a custom Resolver to apply the UnicodeRangeMetric to all of the columns.
If you're interested in seeing how you can add or remove different metrics according to the column type or column name, please refer to this example on Schema Configuration
from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics.unicode_range import UnicodeRangeMetric
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from typing import Dict
from whylogs.core.metrics import Metric, MetricConfig
class UnicodeResolver(Resolver):
def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}
Resolvers are passed to whylogs through a DatasetSchema
, so we'll have to create a custom Schema as well.
We'll just have to:
config = MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})
schema = DatasetSchema(resolvers=UnicodeResolver(), default_configs=config)
If a default MetricConfig is not passed, it would use the default unicode ranges, which would track the default ranges such as: emoticons, control characters and extended latin.
We can now log the dataframe and pass our schema when calling log
:
import whylogs as why
prof_results = why.log(df, schema=DatasetSchema(resolvers=UnicodeResolver(), default_configs=MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})))
prof = prof_results.profile()
Let's take a look at the Profile View:
profile_view_df = prof.view().to_pandas()
profile_view_df
type | unicode_range/UNKNOWN:cardinality/est | unicode_range/UNKNOWN:cardinality/lower_1 | unicode_range/UNKNOWN:cardinality/upper_1 | unicode_range/UNKNOWN:counts/n | unicode_range/UNKNOWN:counts/null | unicode_range/UNKNOWN:distribution/max | unicode_range/UNKNOWN:distribution/mean | unicode_range/UNKNOWN:distribution/median | unicode_range/UNKNOWN:distribution/min | ... | unicode_range/string_length:distribution/q_95 | unicode_range/string_length:distribution/q_99 | unicode_range/string_length:distribution/stddev | unicode_range/string_length:ints/max | unicode_range/string_length:ints/min | unicode_range/string_length:types/boolean | unicode_range/string_length:types/fractional | unicode_range/string_length:types/integral | unicode_range/string_length:types/object | unicode_range/string_length:types/string | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
mixed | SummaryType.COLUMN | 5.0 | 5.0 | 5.00025 | 0 | 0 | 9.0 | 4.0 | 4.0 | 0.0 | ... | 32.0 | 32.0 | 9.939819 | -9223372036854775807 | 9223372036854775807 | 0 | 0 | 0 | 0 | 0 |
onlyAlpha | SummaryType.COLUMN | 1.0 | 1.0 | 1.00005 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 7.0 | 7.0 | 1.264911 | -9223372036854775807 | 9223372036854775807 | 0 | 0 | 0 | 0 | 0 |
onlyDigits | SummaryType.COLUMN | 1.0 | 1.0 | 1.00005 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3.0 | 3.0 | 0.748331 | -9223372036854775807 | 9223372036854775807 | 0 | 0 | 0 | 0 | 0 |
3 rows × 105 columns
You can see there's a lot of different metrics for each of the original dataframe's columns. In the unicode_range
metric, we'll have additional sub metrics. In this case, we have metrics for:
For each of these submetrics, we have metric components such as:
For instance, let's check the mean for alpha
profile_view_df['unicode_range/alpha:distribution/mean']
column mixed 12.8 onlyAlpha 5.0 onlyDigits 0.0 Name: unicode_range/alpha:distribution/mean, dtype: float64
The above values shows a mean of 0 for onlyDigits
- which is expected, since we don't have any letters in this column, only digits. We also have a mean of 5 for onlyAlpha
, which will coincide of the string's length mean for the same column, since we only have letters characters in this columns. For mixed
the mean is 12.8, and we can indeed see that this column has a higher count of letter character than the previous columns.
You might notice that, even though we defined the range for only lowercase letters, uppercase characters also are included when calculating the metrics. That happens because the strings are all lowercased during preprocessing before tracking the strings.
Let's now check the UNKNOWN
namespace:
profile_view_df['unicode_range/UNKNOWN:distribution/mean']
column mixed 4.0 onlyAlpha 0.0 onlyDigits 0.0 Name: unicode_range/UNKNOWN:distribution/mean, dtype: float64
Since we have only digits and letters in onlyDigit
and onlyAlpha
, there are no characters outside of the defined ranges, yielding means of 0. In the mixed
, however, this value is non-zero, since there are characters such as ., -, º
, and whitespaces, that are not in any of the predefined ranges.
The last namespace string_lenth
, contains metrics for the string's length:
profile_view_df['unicode_range/string_length:distribution/min']
column mixed 8.0 onlyAlpha 3.0 onlyDigits 1.0 Name: unicode_range/string_length:distribution/min, dtype: float64
The string_length
doesn't take into account any particular range. It containts aggregate metrics for the overall string length of each column. In this case, we're seeing the minimum value for the 3 columns: 1 for onlyDigits
, 3 for onlyAlpha
and 8 for mixed
. Since the dataframe used here is very small, we can easily check the original data and verify that these metrics are indeed correct.
Feel free to define your own ranges of interest and combine the UnicodeRange metrics with other standard metrics as you see fit!
The resulting profiles can be:
or used for other purposes, such as:
Be sure to check the other examples at whylogs' Documentation!