🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
This is a
whylogs v1
example. For the analog feature inv0
, please refer to this example
In this example, we'll show how to define a number of simple constraints and examples on how to use them. For the basics on how to build your own set of constraints, see the example - Data Validation with Metric Constraints.
The constraints are listed according to the metric namespace used when defining them. For each category, we will create helper functions for simple and popular constraints. Each helper function has a brief explanation in its docstring. After defining the helper functions, we'll show a simple example on how to build the constraints out of the functions and visualize them as a report with the visualization module.
Note: The constraints shown here are still experimental and subject to further changes. Stay tuned for upgrades!
constraint | parameters | semantic | metric |
---|---|---|---|
no_missing_values | column name | Checks that are no missing values in the column | Counts |
null_values_below_number | column name, number | Number of null values must be below given number. | Counts |
null_percentage_below_number | column name, number | Percentage of null values must be below given number. | Counts |
constraint | parameters | semantic | metric |
---|---|---|---|
greater_than_number | column name | Minimum value of given column must be above defined number. | Distribution |
smaller_than_number | column name, number | Maximum value of given column must be below defined number. | Distribution |
is_in_range | column name, lower, upper | Checks that all of column's values are in defined range (inclusive). | Distribution |
is_non_negative | column name | Checks if a column is non negative. | Distribution |
n_most_common_items_in_set | column name, reference set | Checks if the top n most common items appear in the dataset. | Frequent Items |
frequent_strings_in_reference_set | column name, reference set | Checks if a set of variables appear in the frequent strings for a string column. | Frequent Items |
count_below_number | column name, number | Checks if elements in a column are below given number. | Counts |
distinct_number_in_range | column name, lower, upper | Checks if number of distinct categories is between lower and upper values (inclusive). | Cardinality |
column_is_nullable_integral | column name | Check if column contains only records of specific datatype. | Types |
column_is_nullable_boolean | column name | Check if column contains only records of specific datatype. | Types |
column_is_nullable_fractional | column name | Check if column contains only records of specific datatype. | Types |
column_is_nullable_object | column name | Check if column contains only records of specific datatype. | Types |
column_is_nullable_string | column name | Check if column contains only records of specific datatype. | Types |
Please refer to the example Metric Constraints with Condition Count Metrics for examples on how to use these constraints.
constraint | parameters | semantic | metric |
---|---|---|---|
condition_meets | column name, condition_name | Fails if condition not met at least once. | Condition Count |
condition_never_meets | column name, condition_name | Fails if condition is met at least once | Condition Count |
condition_count_below | column name, condition_name, max_count | Fails if condition is met more than max count | Condition Count |
constraint | parameters | semantic | metric |
---|---|---|---|
mean_between_range | column name, lower, upper | Mean must be between range defined by lower and upper bounds. | Distribution |
stddev_between_range | column name, lower, upper | Standard deviarion must be between range defined by lower and upper bounds. | Distribution |
quantile_between_range | column name, quantile, lower, upper | Q-th quantile value must be withing the range defined by lower and upper boundaries. | Distribution |
If you haven't already, install whylogs:
# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[viz]'
Then, let's import the helper functions needed to define the constraints:
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import (
greater_than_number,
is_in_range,
is_non_negative,
mean_between_range,
smaller_than_number,
stddev_between_range,
quantile_between_range
)
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "hawk", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, 4, 6],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
results = why.log(pd.DataFrame(data))
profile_view = results.view()
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(greater_than_number(column_name="weight", number=0.14))
builder.add_constraint(mean_between_range(column_name="weight", lower=2, upper=3))
builder.add_constraint(smaller_than_number(column_name="weight", number=20.5))
builder.add_constraint(stddev_between_range(column_name="weight", lower=1, upper=3))
builder.add_constraint(quantile_between_range(column_name="weight", quantile=0.5, lower=1.5, upper=2.0))
builder.add_constraint(is_in_range(column_name="weight", lower=1.1, upper=3.2))
builder.add_constraint(is_in_range(column_name="legs", lower=0, upper=6))
builder.add_constraint(is_non_negative(column_name="legs"))
# animal has missing distribution metrics. this will pass if skip_missing = True and fail otherwise.
builder.add_constraint(
quantile_between_range(
column_name="animal",
quantile=0.5,
lower=1.5,
upper=2.0,
skip_missing=False
)
)
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
from whylogs.core.constraints.factories import n_most_common_items_in_set, frequent_strings_in_reference_set
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "snake", "snake", "cat", "mosquito"],
"legs": [0, 1, 2, 3, 4],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
results = why.log(pd.DataFrame(data))
profile_view = results.view()
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
reference_set = {"cat","snake"}
builder.add_constraint(frequent_strings_in_reference_set(column_name="animal", reference_set=reference_set))
builder.add_constraint(n_most_common_items_in_set(column_name="animal",n=2,reference_set=reference_set))
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
from whylogs.core.constraints.factories import no_missing_values, count_below_number, null_percentage_below_number, null_values_below_number
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "snake", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, None, 6],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
results = why.log(pd.DataFrame(data))
profile_view = results.view()
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(count_below_number(column_name="legs", number=10))
builder.add_constraint(null_percentage_below_number(column_name="legs", number=0.05))
builder.add_constraint(null_values_below_number(column_name="legs", number=1))
builder.add_constraint(no_missing_values(column_name="legs"))
builder.add_constraint(no_missing_values(column_name="animal"))
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
from whylogs.core.constraints.factories import distinct_number_in_range
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "snake", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, None, 6],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
results = why.log(pd.DataFrame(data))
profile_view = results.view()
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(distinct_number_in_range(column_name = "animal", lower = 3, upper = 6))
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "snake", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, None, 6],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
"flies": [False, False, "False", False, True],
"obj": [{"a":1}, None, {"a":1}, {"a":1}, {"a":1}]
}
df = pd.DataFrame(data)
results = why.log(df)
profile_view = results.view()
from whylogs.core.constraints.factories import (
column_is_nullable_integral,
column_is_nullable_boolean,
column_is_nullable_fractional,
column_is_nullable_object,
column_is_nullable_string,
)
from whylogs.core.constraints import ConstraintsBuilder
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(column_is_nullable_string(column_name="animal"))
builder.add_constraint(column_is_nullable_integral(column_name="legs"))
builder.add_constraint(column_is_nullable_fractional(column_name="weight"))
builder.add_constraint(column_is_nullable_boolean(column_name="flies"))
builder.add_constraint(column_is_nullable_object(column_name="obj"))
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
The constraints above will pass if all values are of a given type. Null values are accepted.
Note that for legs
, the constraints failed. That is because whylogs leverages pandas' dtypes when it is available, and when a None
is present, the column is considered to be fractional
, even though the remaining values were originally integers.
To create a constraint that checks for a non-nullable type, we combine two separate constraints:
column is nullable datatype
null values below 1
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "snake", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, None, 6],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
"flies": [False, False, "False", False, True],
"obj": [{"a":1}, None, {"a":1}, {"a":1}, {"a":1}]
}
df = pd.DataFrame(data)
results = why.log(df)
profile_view = results.view()
from whylogs.core.constraints.factories import (
column_is_nullable_integral,
column_is_nullable_boolean,
column_is_nullable_fractional,
column_is_nullable_object,
column_is_nullable_string,
null_values_below_number,
)
from whylogs.core.constraints import ConstraintsBuilder
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(column_is_nullable_string(column_name="animal"))
builder.add_constraint(null_values_below_number(column_name="animal",number=1))
# The combination of these metrics makes a check of non-nullable integral
builder.add_constraint(column_is_nullable_integral(column_name="legs"))
builder.add_constraint(null_values_below_number(column_name="legs",number=1))
# The combination of these metrics makes a check of non-nullable fractional
builder.add_constraint(column_is_nullable_fractional(column_name="weight"))
builder.add_constraint(null_values_below_number(column_name="weight",number=1))
# The combination of these metrics makes a check of non-nullable boolean
builder.add_constraint(column_is_nullable_boolean(column_name="flies"))
builder.add_constraint(null_values_below_number(column_name="flies",number=1))
# The combination of these metrics makes a check of non-nullable object
builder.add_constraint(column_is_nullable_object(column_name="obj"))
builder.add_constraint(null_values_below_number(column_name="obj",number=1))
constraints = builder.build()
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)