🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
In this notebook, we'll show how you can use whylog's Profile Viewer (profile.view()
) to find useful statistics in a dataset.
This includes:
We'll need the whylogs
and pandas
libraries for this example.
We'll also populate a dataframe with some data to inspect.
# install whylogs & pandas if needed
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
%pip install pandas
# import whylogs and pandas
import whylogs as why
import pandas as pd
# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
# create a simple test dataset
data = {
"animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo",
"jellyfish", "jellyfish", "fish"],
"legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"],
"weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],
}
# Create dataframe with test dataset
df = pd.DataFrame(data)
# Log data with whylogs & create profile
results = why.log(pandas=df)
profile = results.profile()
# Create profile view dataframe
prof_view = profile.view()
prof_df = prof_view.to_pandas()
# View Profile dataframe for dataset statistics
prof_df
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | frequent_items/frequent_strings | type | distribution/mean | distribution/stddev | distribution/n | distribution/max | distribution/min | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | ||||||||||||||||||||||
legs | 9 | 3 | 4 | 1 | 0 | 1 | 0 | 4.0 | 4.00020 | 4.0 | [FrequentItem(value='4.000000', est=3, upper=3... | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
weight | 9 | 0 | 0 | 9 | 0 | 0 | 0 | 9.0 | 9.00045 | 9.0 | NaN | SummaryType.COLUMN | 20.955556 | 38.29749 | 9.0 | 120.0 | 1.2 | 1.2 | 2.2 | 4.3 | 14.3 | 120.0 |
animal | 9 | 0 | 0 | 0 | 0 | 9 | 0 | 7.0 | 7.00035 | 7.0 | [FrequentItem(value='jellyfish', est=3, upper=... | SummaryType.COLUMN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics' dataframe contains a specific dimension of a given Metric.
Taking a quick look at the generated statistics:
The animal row shows there are 9
entries (counts/n). All the data types are strings. Cardinality estimates that 7
different animal types are in the dataset. Frequent items show jellyfish
appearing the most.
Our weight data contains 9
entries. All of them are fractional
values. Cardinality shows that all 9
values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.
We can see that there are 9
entries for leg values, but they're several different data types. 3 null
, 4 integrals
, 1 float
, and 1 string
. Cardinality estimates 5
unique values. The most frequent number of legs that appear in the dataset is 4
.
A single cell can be selected to see full results if needed.
# Select a single statistic by feature and row
prof_df['frequent_items/frequent_strings']['animal']
[FrequentItem(value='jellyfish', est=3, upper=3, lower=3), FrequentItem(value='cat', est=1, upper=1, lower=1), FrequentItem(value='lion', est=1, upper=1, lower=1), FrequentItem(value='fish', est=1, upper=1, lower=1), FrequentItem(value='shark', est=1, upper=1, lower=1), FrequentItem(value='kangaroo', est=1, upper=1, lower=1), FrequentItem(value='bear', est=1, upper=1, lower=1)]
By default whylogs will automatically generate these metrics based on data types.
The standard metrics available in whylogs are grouped in namespaces. They are:
Counts and inferred data types track how many entries exist and what type data they contain.
counts/n
- the total number of entries in a featurecounts/null
the number of null valuestypes/integral
- the number of values consisting of an integral (whole number)types/fractional
- the number of values consisting of a fractional value (float)types/boolean
- the number of values consisting of a booleantypes/string
- the number of values consisting of a stringtypes/object
- the number of values consisting of an object. If the data is not of any of the previous types, it will be assumed as an objectCardinality tracks an approximate unique value for each feature
cardinality/est
- the estimated unique values for each featurecardinality/upper_1
- upper bound for the cardinality estimation. The actual cardinality will always be below this number.cardinality/lower_1
- lower bound for the cardinality estimation. The actual cardinality will always be above this number.Frequent items track which items show up the most.
frequent_items/frequent_strings
- the most frequent itemsDistribution statistics are generated when a feature contains numerical data.
distribution/mean
- the calculated mean of the feature datadistribution/stddev
- the calculated standard deviation of the feature datadistribution/n
- the number of rows belonging to the featuredistribution/max
- the highest (max) value in the featuredistribution/min
- the smallest (min) value in the featuredistribution/median
- the median value of the feature datadistribution/q_xx
- the xx-th quantile value of the data's distributionwhylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:
By default, whylogs will track the following metrics according to the column's inferred data type:
counts
types
distribution
ints
cardinality
frequent_items
counts
types
cardinality
distribution
counts
types
cardinality
frequent_items
If you want to know how you can customize this configuration, selecting the metrics according to the data type or column name, please go to the Schema Configuration example
That's it! If you want to know more about whylogs, check our documentation.