🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Drift Analysis with Profile Visualizer¶

This is a whylogs v1 example. For the analog feature in v0, please refer to this example

In this notebook, we'll show how you can use whylog's Notebook Profile Visualizer to compare two different sets of the same data. This includes:

Data Drift: Detecting feature drift between two datasets' distributions
Data Visualization: Comparing feature's histograms and bar charts
Summary Statistics: Visualizing Summary Statistics of individual features

Data Drift on Wine Quality¶

To demonstrate the Profile Visualizer, let's use UCI's Wine Quality Dataset, frequently used for learning purposes. Classification is one possible task, where we predict the wine's quality based on its features, like pH, density and percent alcohol content.

In this example, we will split the available dataset in two groups: wines with alcohol content (alcohol feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. The goal here is to induce a case of Sample Selection Bias, where the training sample is not representative of the population.

The example used here was inspired by the article A Primer on Data Drift. If you're interested in more information on this use case, or the theory behind Data Drift, it's a great read!

Installing Dependencies¶

To use the Profile Visualizer, we'll install whylogs with the extra package viz:

In [ ]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[viz]'

Loading the data¶

In [2]:

import pandas as pd
pd.options.mode.chained_assignment = None  # Disabling false positive warning

# this is the same data as encountered in http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/WineQuality/winequality-red.csv"
wine = pd.read_csv(url)
wine.head()

Out[2]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

In [3]:

wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

We'll split the wines in two groups. The ones with alcohol below 11 will form our reference sample, and the ones above 11 will form our target dataset.

In [4]:

cond_reference = (wine['alcohol']<=11)
wine_reference = wine.loc[cond_reference]

cond_target = (wine['alcohol']>11)
wine_target = wine.loc[cond_target]

Let's also add some missing values to citric acid, to see how this is reflected in the Profile Visualizer later on.

In [5]:

ixs = wine_target.iloc[100:110].index
wine_target.loc[ixs,'citric acid'] = None

The quality feature is a numerical one, representing the wine's quality. Let's tranform it to a categorical feature, where each wine is classified as Good or Bad. Anything above 6.5 is a good a wine. Otherwise, it's bad.

In [6]:

import pandas as pd

bins = (2, 6.5, 8)
group_names = ['bad', 'good']

wine_reference['quality'] = pd.cut(wine_reference['quality'], bins = bins, labels = group_names)
wine_target['quality'] = pd.cut(wine_target['quality'], bins = bins, labels = group_names)

Now, we can profile our dataframes with whylogs. The NotebookProfileVisualizer accepts profile_views as arguments. Profile views are obtained from the profiles, and are used for visualization and merging purposes.

In [7]:

import whylogs as why
result = why.log(pandas=wine_target)
prof_view = result.view()

⚠️ No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.

In [8]:

result_ref = why.log(pandas=wine_reference)
prof_view_ref = result_ref.view()

Let's instantiate NotebookProfileVisualizer and set the reference and target profile views:

In [9]:

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)

Now, we're able to generate all sorts of plots and reports. Let's take a look at some of them.

Profile Summary¶

Profile Summary brings you a summary for a single profile. It requires only the existence of target_profile_view. The report shows simple histograms for each feature, along with key statistics, such as number of missing values, mean, minimum and maximum values.

In [10]:

visualization.profile_summary()

Out[10]:

Drift Summary¶

We can also compare two different profiles. With summary_drift_report, we have overview statistics, such as number of observations and missing cells, as well as a comparison between both profile's features, with regards to each feature's distribution, and drift calculations for numerical or categorical features.

The report also displays alerts related to each of the feature's drift severity.

You can also search for a specific feature, or filter by drift severity.

In [11]:

visualization.summary_drift_report()

Out[11]:

The drift values are calculated in different ways, depending on the existing metrics for each column. Kolmogorov-Smirnov Test is calculated if distribution metrics exists for said column. If not, Chi Square Test is calculated if frequent items, cardinality and count metric exists. If not, then no drift value is associated to the column.

For alcohol, there's an alert of severe drift, with calculated p-value of 0.00. That makes sense, since both distributions are mutually exclusive.

We can also conclude some thing just by visually inspecting the distributions. We can clearly see that the "good-to-bad" ratio changes significantly between both profiles. That in itself is a good indicator that the alcohol content might be correlated to the wine's quality

The drift value is also relevant for a number of other features. For example, the density also is flagged with significant drift. Let's look at this feature in particular.

Histograms and Bar charts¶

Now that we have a general idea of both profiles, let's take a look at some of the individual features.

First, let's use the double_histogram to check on the density feature.

In [12]:

visualization.double_histogram(feature_name="density")

Out[12]:

We can visually assess that there seems to be a drift between both distributions indeed. Maybe the alcohol content plays a significant role on the wine's density.

As is the case with the alcohol content, our potential model would see density values in production never before seen in the training test. We can certainly expect performance degradation during production.

We can also pass a list of feature names to double_histogram to generate multiple plots. Let's check the alcohol and chlorides features. For the alcohol feature, there's a clear separation between the distributions at the value of 11.

In [13]:

visualization.double_histogram(feature_name=["alcohol","chlorides"])

Out[13]:

In addition to the histograms, we can also plot distribution charts for categorical variables, like the quality feature.

In [14]:

visualization.distribution_chart(feature_name="quality")

Out[14]:

distribution_charts also accepts multiple feature names, but in this case we have a single categorical feature.

We can also look at the difference between distributions:

In [15]:

visualization.difference_distribution_chart(feature_name="quality")

Out[15]:

We can see that there is 800 or so more "bads" in the Reference profile, and 50 or so more "goods" on the target profile.

Feature Statistics¶

With feature_statistics, we have access to very useful statistics by passing the feature and profile name.

As with the previous reports (double_histogram and distribution_chart) you can pass a string or a list of strings through feature_name. Let's take a look at the summary statistics for some of our features:

In [19]:

visualization.feature_statistics(feature_name=["density","alcohol","chlorides"], profile="target")

Out[19]:

Looks like we have 72 distinct values for citric acid, ranging from 0 to 0.79. We can also see the 10 missing values injected earlier.

Downloading the Visualization Output¶

All of the previous visualizations can be downloaded in HTML format for further inspection. Just run:

In [17]:

import os
os.getcwd()
visualization.write(
    rendered_html=visualization.profile_summary(),
    html_file_name=os.getcwd() + "/example",
)

We're downloading the constraints report here, but you can simply replace it for your preferred visualization.

🚩 Create a free WhyLabs account to get more value out of whylogs!