🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
In whylogs, you can calculate drift scores and generate a summary drift report between two profiles, as shown in the Notebook Profile Visualizer example.
In this example, we will show you how to apply drift calculation with the default algorithm selection, and also to customize the drift calculations in two ways: by choosing the algorithm of your choosing and by changing the algorithm's internal parameters and thresholds for drift detection. We will also show you how to calculate drifts in a standalone manner, without the need to generate a visualization with the summary report.
Currently, whylogs supports the following drift algorithms: Kolmogorov-Smirnov Test, ChiSquare Test, and Hellinger distance - Stay tuned for more algorithms to be added in the future!
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
First, we will generate two profiles, one as the target and one as the reference.
We will use those profiles in order to calculate drift scores for each column in both profiles.
import whylogs as why
import pandas as pd
data = {
"animal": ["cat", "hawk", "snake", "cat"],
"legs": [4, 2, 0, 4],
"weight": [4.3, 1.8, None, 4.1],
}
df = pd.DataFrame(data)
data2 = {
"animal": ["cat", "hawk", "snake", "cat"],
"legs": [13, 34, 99, 123],
"weight": [4.9, 13.3, None, 232.3],
}
df2 = pd.DataFrame(data2)
target_view = why.log(df).profile().view()
ref_view = why.log(df2).profile().view()
You can calculate drift scores between your profiles in two ways. The first is to use calculate_drift_scores
, which whill give you a dictionary of drift scores for each feature.
The second is to view it integrated into the Notebook Profile Visualizer by calling summary_drift_report
. This will give you a drift summary report in the format of an in-notebook visualization or a downloadable HTML file.
Let's see both cases for the default behavior scenario - we won't specify any drift algorithms or parameters.
To get a dictionary with the drift scores, you can use the calculate_drift_scores
method:
from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores
scores = calculate_drift_scores(target_view=target_view, reference_view=ref_view, with_thresholds = True)
scores
{'animal': {'algorithm': 'chi-square', 'pvalue': 1.0, 'statistic': 0.0, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'NO_DRIFT'}, 'legs': {'algorithm': 'ks', 'pvalue': 0.0, 'statistic': 1.0, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'DRIFT'}, 'weight': {'algorithm': 'ks', 'pvalue': 0.0, 'statistic': 1.0, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'DRIFT'}}
The scores
object is a dictionary with the drift scores for each feature with additional metadata.
We can see that the KS test was applied for both weight
and animal
, and chi-squared
was applied for animal
. The default behavior for choosing which drift algorithm to use is the following: KS is calculated if distribution metrics exists for said column. If not, Chi2 is calculated if frequent items, cardinality and count metric exists. If not, then no drift value is associated to the column.
We can also see the thresholds defined by default for each algorithm. Each drift category contains a tuple defining a range: if the measure falls within the range, then the drift category is assigned to the column. For each range, the lower bound is inclusive, while the upper bound is exclusive, except for the maximum upper bound, which is inclusive.
The drift categorization will use either the pvalue
or statistic
value, depending on the algorithm. Both KS and Chi Square tests compare the pvalue against the thresholds, while the Hellinger distance compares the statistic
value against the thresholds.
We can also visualize this information integrated with the NotebookProfileVisualizer summary_drift_report
:
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=target_view, reference_profile_view=ref_view)
visualization.summary_drift_report()
Feel free to explore the dashboard! You can search by column names and filter by drift categorization. You can also drift on each column's drift category to check the thresholds that were used for the categorization.
You can overwrite the default algorithm selection logic by explicitly stating which algorithms you want to be run.
Suppose now we want hellinger to be used for the weight
column, and chi-squared to be used for the legs
column. We can do this by passing a dictionary with the column names as keys and the drift algorithm as values.
from whylogs.viz.drift.column_drift_algorithms import Hellinger, ChiSquare
drift_map = {"weight": Hellinger(),"legs": ChiSquare()}
scores = calculate_drift_scores(target_view=target_view, reference_view=ref_view, with_thresholds = True, drift_map=drift_map)
scores
{'animal': {'algorithm': 'chi-square', 'pvalue': 1.0, 'statistic': 0.0, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'NO_DRIFT'}, 'legs': {'algorithm': 'chi-square', 'pvalue': 0.0, 'statistic': inf, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'DRIFT'}, 'weight': {'algorithm': 'hellinger', 'pvalue': None, 'statistic': 0.4283729905961321, 'thresholds': {'NO_DRIFT': (0, 0.15), 'POSSIBLE_DRIFT': (0.15, 0.4), 'DRIFT': (0.4, 1)}, 'drift_category': 'DRIFT'}}
Note that we didn't specify an algorithm for animal
, and we got a drift score nonetheless. If you don't specify an algorithm for a column, the default algorithm selection logic will be used.
In the Visualizer's case, you can choose the algorithms by using the add_drift_config
method.
The cell below will also define hellinger
for weight
and chi-squared
for legs
:
visualization.add_drift_config(column_names=["weight"], algorithm=Hellinger())
visualization.add_drift_config(column_names=["legs"], algorithm=ChiSquare())
visualization.summary_drift_report()
In addition to selecting which algorithms to use, you can also customize the internal parameters of each algorithm.
For example, suppose we want to change the thresholds for the hellinger
algorithm, making it less sensitive to drift. We can do this by passing a parameter_config
object when instantiating the Hellinger algorithm. One of the parameters in the parameter_config
object is a DriftThresholds object, which contains the thresholds for the drift categorization.
We might also want to change the KS algorithm. In whylogs, the quantiles are split into 100 bins by default. If you want to use another number, you can create a KSTestConfig
object with your own value for quantiles
and pass it to the KSTest
algorithm.
Finally, suppose we don't want the Chi Square algorithm to categorize into 3 different classes. We want only a binary categorization, where the column is either drifted or not. We can do this by passing a ChiSquareConfig
object with a DriftThresholds
object with only two thresholds.
Let's see how to create those config objects and how to pass them to either calculate_drift_score
or summary_drift_report
:
from whylogs.viz.drift.configs import KSTestConfig, HellingerConfig, ChiSquareConfig, DriftThresholds
hellingerconfig = HellingerConfig(thresholds=DriftThresholds(NO_DRIFT=(0, 0.15), POSSIBLE_DRIFT=(0.15,0.5), DRIFT=(0.5, 1)))
quantiles = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]
ksconfig = KSTestConfig(quantiles=quantiles)
chisquareconfig = ChiSquareConfig(thresholds=DriftThresholds(DRIFT=(0, 0.1), NO_DRIFT=(0.1, 1)))
With the configs at hand, we can now pass them to the Drift Algorithms:
from whylogs.viz.drift.column_drift_algorithms import Hellinger, ChiSquare, KS
drift_map = {
"weight": Hellinger(hellingerconfig),
"animal": ChiSquare(chisquareconfig),
"legs": KS(ksconfig),
}
scores = calculate_drift_scores(target_view=target_view, reference_view=ref_view, with_thresholds = True, drift_map=drift_map)
scores
{'animal': {'algorithm': 'chi-square', 'pvalue': 1.0, 'statistic': 0.0, 'thresholds': {'NO_DRIFT': (0.1, 1), 'DRIFT': (0, 0.1)}, 'drift_category': 'NO_DRIFT'}, 'legs': {'algorithm': 'ks', 'pvalue': 0.0, 'statistic': 1.0, 'thresholds': {'NO_DRIFT': (0.15, 1), 'POSSIBLE_DRIFT': (0.05, 0.15), 'DRIFT': (0, 0.05)}, 'drift_category': 'DRIFT'}, 'weight': {'algorithm': 'hellinger', 'pvalue': None, 'statistic': 0.4283729905961321, 'thresholds': {'NO_DRIFT': (0, 0.15), 'POSSIBLE_DRIFT': (0.15, 0.5), 'DRIFT': (0.5, 1)}, 'drift_category': 'POSSIBLE_DRIFT'}}
visualization.add_drift_config(column_names=["weight"], algorithm=Hellinger(hellingerconfig))
visualization.add_drift_config(column_names=["animal"], algorithm=ChiSquare(chisquareconfig))
visualization.add_drift_config(column_names=["legs"], algorithm=KS(ksconfig))
visualization.summary_drift_report()
Overwriting existing drift configuration for column weight. Overwriting existing drift configuration for column legs.
You can check on either the scores object or the drift report that the changes were indeed applied.
Stay tuned for more Drift Algorithms to be added to whylogs!