🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

String Tracking - Unicode Range and String Length¶

By default, columns of type str will have the following metrics, when logged with whylogs:

Counts
Types
Frequent Items/Frequent Strings
Cardinality

In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as mean, stddev and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length.

In this example, we're interested in tracking two specific ranges of characters:

ASCII Digits (unicode range 48-57)
Latin alphabet (unicode range 97-122)

For more info on the unicode list of characters, check this Wikipedia Article

Installing whylogs¶

If you haven't already, install whylogs:

In [ ]:

# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Creating the Data¶

Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:

onlyDigits: Column of strings that contain only digit characters
onlyAlpha: Column of strings that contain only latin letters (no digits)
mixed: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbols

In [2]:

import whylogs as why
import pandas as pd
data = {
    "onlyDigits": ["12", "83", "1", "992", "7"],
    "onlyAlpha": ["Alice", "Bob", "Chelsea", "Danny", "Eddie"],
    "mixed": ["my_email_1989@gmail.com","ADK-1171","Copacabana 272 - Rio de Janeiro","21º C Friday - Sao Paulo, Brasil","18127819ASW"]
}
df = pd.DataFrame(data)
df.head()

Out[2]:

	onlyDigits	onlyAlpha	mixed
0	12	Alice	my_email_1989@gmail.com
1	83	Bob	ADK-1171
2	1	Chelsea	Copacabana 272 - Rio de Janeiro
3	992	Danny	21º C Friday - Sao Paulo, Brasil
4	7	Eddie	18127819ASW

Configuring the Metrics in the DatasetSchema¶

whylogs uses Resolvers in order to define the set of metrics tracked for a column name or data type. In this case, we'll create a custom Resolver to apply the UnicodeRangeMetric to all of the columns.

If you're interested in seeing how you can add or remove different metrics according to the column type or column name, please refer to this example on Schema Configuration

In [3]:

from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics.unicode_range import UnicodeRangeMetric
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from typing import Dict
from whylogs.core.metrics import Metric, MetricConfig

class UnicodeResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}

Resolvers are passed to whylogs through a DatasetSchema, so we'll have to create a custom Schema as well.

We'll just have to:

Pass the UnicodeResolver created previously
Since we're interested in changing the default character ranges, we'll also pass a Metric Configuration with the desired ranges

In [4]:

config = MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})
schema = DatasetSchema(resolvers=UnicodeResolver(), default_configs=config)

If a default MetricConfig is not passed, it would use the default unicode ranges, which would track the default ranges such as: emoticons, control characters and extended latin.

We can now log the dataframe and pass our schema when calling log:

In [5]:

import whylogs as why

prof_results = why.log(df, schema=DatasetSchema(resolvers=UnicodeResolver(), default_configs=MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})))
prof = prof_results.profile()

Unicode Range and String Length Metrics¶

Let's take a look at the Profile View:

In [6]:

profile_view_df = prof.view().to_pandas()
profile_view_df

Out[6]:

	type	unicode_range/UNKNOWN:cardinality/est	unicode_range/UNKNOWN:cardinality/lower_1	unicode_range/UNKNOWN:cardinality/upper_1	unicode_range/UNKNOWN:counts/n	unicode_range/UNKNOWN:counts/null	unicode_range/UNKNOWN:distribution/max	unicode_range/UNKNOWN:distribution/mean	unicode_range/UNKNOWN:distribution/median	unicode_range/UNKNOWN:distribution/min	...	unicode_range/string_length:distribution/q_95	unicode_range/string_length:distribution/q_99	unicode_range/string_length:distribution/stddev	unicode_range/string_length:ints/max	unicode_range/string_length:ints/min	unicode_range/string_length:types/boolean	unicode_range/string_length:types/fractional	unicode_range/string_length:types/integral	unicode_range/string_length:types/object	unicode_range/string_length:types/string
column
mixed	SummaryType.COLUMN	5.0	5.0	5.00025	0	0	9.0	4.0	4.0	0.0	...	32.0	32.0	9.939819	-9223372036854775807	9223372036854775807	0	0	0	0	0
onlyAlpha	SummaryType.COLUMN	1.0	1.0	1.00005	0	0	0.0	0.0	0.0	0.0	...	7.0	7.0	1.264911	-9223372036854775807	9223372036854775807	0	0	0	0	0
onlyDigits	SummaryType.COLUMN	1.0	1.0	1.00005	0	0	0.0	0.0	0.0	0.0	...	3.0	3.0	0.748331	-9223372036854775807	9223372036854775807	0	0	0	0	0

3 rows × 105 columns

You can see there's a lot of different metrics for each of the original dataframe's columns. In the unicode_range metric, we'll have additional sub metrics. In this case, we have metrics for:

digits: distribution metrics for characters inside the unicode's digit range
alpha: distribution metrics for characters inside the unicode's lowercase letters range
UNKNOWN: distribution metrics for character that fall anywhere outside the predefined range (digits and alpha)
string_length: distribution metrics for overall string length

For each of these submetrics, we have metric components such as:

mean: the calculated mean for the column
stddev: the calculated standard deviation for the column
n: the total number of record for the column
max, min: maximum and minimum values for the column
q_xx: the xx-th quantile value of the data’s distribution
median: the median for the column

For instance, let's check the mean for alpha

In [7]:

profile_view_df['unicode_range/alpha:distribution/mean']

Out[7]:

column
mixed         12.8
onlyAlpha      5.0
onlyDigits     0.0
Name: unicode_range/alpha:distribution/mean, dtype: float64

The above values shows a mean of 0 for onlyDigits - which is expected, since we don't have any letters in this column, only digits. We also have a mean of 5 for onlyAlpha, which will coincide of the string's length mean for the same column, since we only have letters characters in this columns. For mixed the mean is 12.8, and we can indeed see that this column has a higher count of letter character than the previous columns.

You might notice that, even though we defined the range for only lowercase letters, uppercase characters also are included when calculating the metrics. That happens because the strings are all lowercased during preprocessing before tracking the strings.

Let's now check the UNKNOWN namespace:

In [8]:

profile_view_df['unicode_range/UNKNOWN:distribution/mean']

Out[8]:

column
mixed         4.0
onlyAlpha     0.0
onlyDigits    0.0
Name: unicode_range/UNKNOWN:distribution/mean, dtype: float64

Since we have only digits and letters in onlyDigit and onlyAlpha, there are no characters outside of the defined ranges, yielding means of 0. In the mixed, however, this value is non-zero, since there are characters such as ., -, º, and whitespaces, that are not in any of the predefined ranges.

The last namespace string_lenth, contains metrics for the string's length:

In [9]:

profile_view_df['unicode_range/string_length:distribution/min']

Out[9]:

column
mixed         8.0
onlyAlpha     3.0
onlyDigits    1.0
Name: unicode_range/string_length:distribution/min, dtype: float64

The string_length doesn't take into account any particular range. It containts aggregate metrics for the overall string length of each column. In this case, we're seeing the minimum value for the 3 columns: 1 for onlyDigits, 3 for onlyAlpha and 8 for mixed. Since the dataframe used here is very small, we can easily check the original data and verify that these metrics are indeed correct.

Conclusion¶

Feel free to define your own ranges of interest and combine the UnicodeRange metrics with other standard metrics as you see fit!

The resulting profiles can be:

merged together
stored locally or in the cloud (AWS' S3)

or used for other purposes, such as:

Setting constraints for data quality validation
Visualizing and comparing profiles
Sent to monitoring and observability platforms

Be sure to check the other examples at whylogs' Documentation!

🚩 Create a free WhyLabs account to get more value out of whylogs!