Notebook

Copyright 2020 The TensorFlow Authors.¶

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TensorFlow Data Validation¶

*An Example of a Key Component of TensorFlow Extended*

Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click "Run in Google Colab".

View on TensorFlow.org

Run in Google Colab

View source on GitHub

Download notebook

This example colab notebook illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent.

We'll use data from the Taxi Trips dataset released by the City of Chicago.

Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.

Read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.

Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.

The columns in the dataset are:

pickup_community_area	fare	trip_start_month
trip_start_hour	trip_start_day	trip_start_timestamp
pickup_latitude	pickup_longitude	dropoff_latitude
dropoff_longitude	trip_miles	pickup_census_tract
dropoff_census_tract	payment_type	company
trip_seconds	dropoff_community_area	tips

Install and import packages¶

Install the packages for TensorFlow Data Validation.

Upgrade Pip¶

To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately.

In [ ]:

try:
  import colab
  !pip install --upgrade pip
except:
  pass

Install Data Validation packages¶

Install the TensorFlow Data Validation packages and dependencies, which takes a few minutes. You may see warnings and errors regarding incompatible dependency versions, which you will resolve in the next section.

In [ ]:

print('Installing TensorFlow Data Validation')
!pip install --upgrade 'tensorflow_data_validation[visualization]<2'

Import TensorFlow and reload updated packages¶

The prior step updates the default packages in the Gooogle Colab environment, so you must reload the package resources to resolve the new dependencies.

Note: This step resolves the dependency error from the installation. If you are still experiencing code execution problems after running this code, restart the runtime (Runtime > Restart runtime ...).

In [ ]:

import pkg_resources
import importlib
importlib.reload(pkg_resources)

Check the versions of TensorFlow and the Data Validation before proceeding.

In [ ]:

import tensorflow as tf
import tensorflow_data_validation as tfdv
print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

Load the dataset¶

We will download our dataset from Google Cloud Storage.

In [ ]:

import os
import tempfile, urllib, zipfile

# Set up some globals for our file paths
BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, 'data')
OUTPUT_DIR = os.path.join(BASE_DIR, 'chicago_taxi_output')
TRAIN_DATA = os.path.join(DATA_DIR, 'train', 'data.csv')
EVAL_DATA = os.path.join(DATA_DIR, 'eval', 'data.csv')
SERVING_DATA = os.path.join(DATA_DIR, 'serving', 'data.csv')

# Download the zip file from GCP and unzip it
zip, headers = urllib.request.urlretrieve('https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/chicago_data.zip')
zipfile.ZipFile(zip).extractall(BASE_DIR)
zipfile.ZipFile(zip).close()

print("Here's what we downloaded:")
!ls -R {os.path.join(BASE_DIR, 'data')}

Compute and visualize statistics¶

First we'll use tfdv.generate_statistics_from_csv to compute statistics for our training data. (ignore the snappy warnings)

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.

In [ ]:

train_stats = tfdv.generate_statistics_from_csv(data_location=TRAIN_DATA)

Now let's use tfdv.visualize_statistics, which uses Facets to create a succinct visualization of our training data:

Notice that numeric features and catagorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
Notice that features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
Notice that there are no examples with values for pickup_census_tract. This is an opportunity for dimensionality reduction!
Try clicking "expand" above the charts to change the display
Try hovering over bars in the charts to display bucket ranges and counts
Try switching between the log and linear scales, and notice how the log scale reveals much more detail about the payment_type categorical feature
Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [ ]:

# docs-infra: no-execute
tfdv.visualize_statistics(train_stats)

Infer a schema¶

Now let's use tfdv.infer_schema to create a schema for our data. A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct. The schema also provides documentation for the data, and so is useful when different developers work on the same data. Let's use tfdv.display_schema to display the inferred schema so that we can review it.

In [ ]:

schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Check evaluation data for errors¶

So far we've only been looking at the training data. It's important that our evaluation data is consistent with our training data, including that it uses the same schema. It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.

Notice that each feature now includes statistics for both the training and evaluation datasets.
Notice that the charts now have both the training and evaluation datasets overlaid, making it easy to compare them.
Notice that the charts now include a percentages view, which can be combined with log or the default linear scales.
Notice that the mean and median for trip_miles are different for the training versus the evaluation datasets. Will that cause problems?
Wow, the max tips is very different for the training versus the evaluation datasets. Will that cause problems?
Click expand on the Numeric Features chart, and select the log scale. Review the trip_seconds feature, and notice the difference in the max. Will evaluation miss parts of the loss surface?

In [ ]:

# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_csv(data_location=EVAL_DATA)

In [ ]:

# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

Check for evaluation anomalies¶

Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values.

Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset? What about numeric features that are outside the ranges in our training dataset?

In [ ]:

# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

Fix evaluation anomalies in the schema¶

Oops! It looks like we have some new values for company in our evaluation data, that we didn't have in our training data. We also have a new value for payment_type. These should be considered anomalies, but what we decide to do about them depends on our domain knowledge of the data. If an anomaly truly indicates a data error, then the underlying data should be fixed. Otherwise, we can simply update the schema to include the values in the eval dataset.

Key Point: How would our evaluation results be affected if we did not fix these problems?

Unless we change our evaluation dataset we can't fix everything, but we can fix things in the schema that we're comfortable accepting. That includes relaxing our view of what is and what is not an anomaly for particular features, as well as updating our schema to include missing values for categorical features. TFDV has enabled us to discover what we need to fix.

Let's make those fixes now, and then review one more time.

In [ ]:

# Relax the minimum fraction of values that must come from the domain for feature company.
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature payment_type.
payment_type_domain = tfdv.get_domain(schema, 'payment_type')
payment_type_domain.value.append('Prcard')

# Validate eval stats after updating the schema 
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Hey, look at that! We verified that the training and evaluation data are now consistent! Thanks TFDV ;)

Schema Environments¶

We also split off a 'serving' dataset for this example, so we should check that too. By default all datasets in a pipeline should use the same schema, but there are often exceptions. For example, in supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included. In some cases introducing slight schema variations is necessary.

Environments can be used to express such requirements. In particular, features in schema can be associated with a set of environments using default_environment, in_environment and not_in_environment.

For example, in this dataset the tips feature is included as the label for training, but it's missing in the serving data. Without environment specified, it will show up as an anomaly.

In [ ]:

serving_stats = tfdv.generate_statistics_from_csv(SERVING_DATA)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

We'll deal with the tips feature below. We also have an INT value in our trip seconds, where our schema expected a FLOAT. By making us aware of that difference, TFDV helps uncover inconsistencies in the way the data is generated for training and serving. It's very easy to be unaware of problems like that until model performance suffers, sometimes catastrophically. It may or may not be a significant issue, but in any case this should be cause for further investigation.

In this case, we can safely convert INT values to FLOATs, so we want to tell TFDV to use our schema to infer the type. Let's do that now.

In [ ]:

options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)
serving_stats = tfdv.generate_statistics_from_csv(SERVING_DATA, stats_options=options)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Now we just have the tips feature (which is our label) showing up as an anomaly ('Column dropped'). Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.

In [ ]:

# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'tips').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    serving_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

Check for drift and skew¶

In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect drift and skew. TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.

Drift¶

Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

Skew¶

TFDV can detect three different kinds of skew in your data - schema skew, feature skew, and distribution skew.

Schema Skew¶

Schema skew occurs when the training and serving data do not conform to the same schema. Both training and serving data are expected to adhere to the same schema. Any expected deviations between the two (such as the label feature being only present in the training data but not in serving) should be specified through environments field in the schema.

Feature Skew¶

Feature skew occurs when the feature values that a model trains on are different from the feature values that it sees at serving time. For example, this can happen when:

A data source that provides some feature values is modified between training and serving time
There is different logic for generating features between training and serving. For example, if you apply some transformation only in one of the two code paths.

Distribution Skew¶

Distribution skew occurs when the distribution of the training dataset is significantly different from the distribution of the serving dataset. One of the key causes for distribution skew is using different code or different data sources to generate the training dataset. Another reason is a faulty sampling mechanism that chooses a non-representative subsample of the serving data to train on.

In [ ]:

# Add skew comparator for 'payment_type' feature.
payment_type = tfdv.get_feature(schema, 'payment_type')
payment_type.skew_comparator.infinity_norm.threshold = 0.01

# Add drift comparator for 'company' feature.
company=tfdv.get_feature(schema, 'company')
company.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

tfdv.display_anomalies(skew_anomalies)

In this example we do see some drift, but it is well below the threshold that we've set.

Freeze the schema¶

Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state.

In [ ]:

from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)

!cat {schema_file}

When to use TFDV¶

It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here's a few more:

Validating new data for inference to make sure that we haven't suddenly started receiving bad features
Validating new data for inference to make sure that our model has trained on that part of the decision surface
Validating our data after we've transformed it and done feature engineering (probably using TensorFlow Transform) to make sure we haven't done something wrong