🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Inspecting Profiles¶

In this notebook, we'll show how you can use whylog's Profile Viewer (profile.view()) to find useful statistics in a dataset.

This includes:

Counters, such as number of samples and null values
Inferred types, such as integral, fractional, boolean, and strings
Estimated cardinality
Frequent items
Distribution metrics: min, max, mean, median, standard deviation, and quantile values

Setup¶

We'll need the whylogs and pandas libraries for this example.

We'll also populate a dataframe with some data to inspect.

In [1]:

# install whylogs & pandas if needed
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
%pip install pandas 

In [2]:

# import whylogs and pandas
import whylogs as why
import pandas as pd

# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

In [3]:

# create a simple test dataset
data = {
    "animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo",
                                      "jellyfish", "jellyfish", "fish"],
    "legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"],
    "weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],
}

# Create dataframe with test dataset
df = pd.DataFrame(data)

Log data with whylogs, create a profile, and view statistics:¶

In [4]:

# Log data with whylogs & create profile
results = why.log(pandas=df)
profile = results.profile()

# Create profile view dataframe
prof_view = profile.view()
prof_df = prof_view.to_pandas()

In [4]:

# View Profile dataframe for dataset statistics
prof_df

Out[4]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	frequent_items/frequent_strings	type	distribution/mean	distribution/stddev	distribution/n	distribution/max	distribution/min	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90
column
legs	9	3	4	1	0	1	0	4.0	4.00020	4.0	[FrequentItem(value='4.000000', est=3, upper=3...	SummaryType.COLUMN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
weight	9	0	0	9	0	0	0	9.0	9.00045	9.0	NaN	SummaryType.COLUMN	20.955556	38.29749	9.0	120.0	1.2	1.2	2.2	4.3	14.3	120.0
animal	9	0	0	0	0	9	0	7.0	7.00035	7.0	[FrequentItem(value='jellyfish', est=3, upper=...	SummaryType.COLUMN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics' dataframe contains a specific dimension of a given Metric.

Taking a quick look at the generated statistics:

animal¶

The animal row shows there are 9 entries (counts/n). All the data types are strings. Cardinality estimates that 7 different animal types are in the dataset. Frequent items show jellyfish appearing the most.

weight¶

Our weight data contains 9 entries. All of them are fractional values. Cardinality shows that all 9 values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.

legs¶

We can see that there are 9 entries for leg values, but they're several different data types. 3 null, 4 integrals, 1 float, and 1 string. Cardinality estimates 5 unique values. The most frequent number of legs that appear in the dataset is 4.

Selecting a single value¶

A single cell can be selected to see full results if needed.

In [64]:

# Select a single statistic by feature and row
prof_df['frequent_items/frequent_strings']['animal']

Out[64]:

[FrequentItem(value='jellyfish', est=3, upper=3, lower=3),
 FrequentItem(value='cat', est=1, upper=1, lower=1),
 FrequentItem(value='lion', est=1, upper=1, lower=1),
 FrequentItem(value='fish', est=1, upper=1, lower=1),
 FrequentItem(value='shark', est=1, upper=1, lower=1),
 FrequentItem(value='kangaroo', est=1, upper=1, lower=1),
 FrequentItem(value='bear', est=1, upper=1, lower=1)]

Understanding The whylogs Profile Statistics¶

By default whylogs will automatically generate these metrics based on data types.

The standard metrics available in whylogs are grouped in namespaces. They are:

Counts & Inferred Data Types:¶

Counts and inferred data types track how many entries exist and what type data they contain.

counts/n - the total number of entries in a feature
counts/null the number of null values
types/integral - the number of values consisting of an integral (whole number)
types/fractional - the number of values consisting of a fractional value (float)
types/boolean - the number of values consisting of a boolean
types/string - the number of values consisting of a string
types/object - the number of values consisting of an object. If the data is not of any of the previous types, it will be assumed as an object

Cardinality¶

Cardinality tracks an approximate unique value for each feature

cardinality/est - the estimated unique values for each feature
cardinality/upper_1 - upper bound for the cardinality estimation. The actual cardinality will always be below this number.
cardinality/lower_1 - lower bound for the cardinality estimation. The actual cardinality will always be above this number.

Frequent Items:¶

Frequent items track which items show up the most.

frequent_items/frequent_strings - the most frequent items

Distribution:¶

Distribution statistics are generated when a feature contains numerical data.

distribution/mean - the calculated mean of the feature data
distribution/stddev - the calculated standard deviation of the feature data
distribution/n - the number of rows belonging to the feature
distribution/max - the highest (max) value in the feature
distribution/min - the smallest (min) value in the feature
distribution/median - the median value of the feature data
distribution/q_xx - the xx-th quantile value of the data's distribution

Data Types and Metrics¶

whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:

Integral
Fractional
String

By default, whylogs will track the following metrics according to the column's inferred data type:

Integral:
- counts
- types
- distribution
- ints
- cardinality
- frequent_items
Fractional:
- counts
- types
- cardinality
- distribution
String:
- counts
- types
- cardinality
- frequent_items

If you want to know how you can customize this configuration, selecting the metrics according to the data type or column name, please go to the Schema Configuration example

That's it! If you want to know more about whylogs, check our documentation.

🚩 Create a free WhyLabs account to get more value out of whylogs!