#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
*The Feature Engineering Component of TensorFlow Extended (TFX)*
Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click "Run in Google Colab".
This example colab notebook provides a very simple example of how TensorFlow Transform (tf.Transform
) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.
TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using TensorFlow Transform you could:
TensorFlow has built-in support for manipulations on a single example or a batch of examples. tf.Transform
extends these capabilities to support full passes over the entire training dataset.
The output of tf.Transform
is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.
To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately.
try:
import colab
!pip install --upgrade pip
except:
pass
Note: In Google Colab, because of package updates, the first time you run this cell you may need to restart the runtime (Runtime > Restart runtime ...).
!pip install -q -U tensorflow_transform==0.24.1
If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime > Restart runtime ...). This is because of the way that Colab loads packages.
import pprint
import tempfile
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
We'll create some simple dummy data for our simple example:
raw_data
is the initial raw data that we're going to preprocessraw_data_metadata
contains the schema that tells us the types of each of the columns in raw_data
. In this case, it's very simple.raw_data = [
{'x': 1, 'y': 1, 's': 'hello'},
{'x': 2, 'y': 2, 's': 'world'},
{'x': 3, 'y': 3, 's': 'hello'}
]
raw_data_metadata = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec({
'y': tf.io.FixedLenFeature([], tf.float32),
'x': tf.io.FixedLenFeature([], tf.float32),
's': tf.io.FixedLenFeature([], tf.string),
}))
The preprocessing function is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a Tensor
or SparseTensor
. There are two main groups of API calls that typically form the heart of a preprocessing function:
Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change. If your data has trend or seasonality components, plan accordingly.
Note: The preprocessing_fn
is not directly callable. This means that
calling preprocessing_fn(raw_data)
will not work. Instead, it must
be passed to the Transform Beam API as shown in the following cells.
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
's_integerized': s_integerized,
'x_centered_times_y_normalized': x_centered_times_y_normalized,
}
Now we're ready to transform our data. We'll use Apache Beam with a direct runner, and supply three inputs:
raw_data
- The raw input data that we created aboveraw_data_metadata
- The schema for the raw datapreprocessing_fn
- The function that we created to do our transformationdef main():
# Ignore the warnings
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
(raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))
if __name__ == '__main__':
main()
Previously, we used tf.Transform
to do this:
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = (x_centered * y_normalized)
With input of [1, 2, 3]
the mean of x is 2, and we subtract it from x to center our x values at 0. So our result of [-1.0, 0.0, 1.0]
is correct.
We wanted to scale our y values between 0 and 1. Our input was [1, 2, 3]
so our result of [0.0, 0.5, 1.0]
is correct.
We wanted to map our strings to indexes in a vocabulary, and there were only 2 words in our vocabulary ("hello" and "world"). So with input of ["hello", "world", "hello"]
our result of [0, 1, 0]
is correct. Since "hello" occurs most frequently in this data, it will be the first entry in the vocabulary.
We wanted to create a new feature by crossing x_centered
and y_normalized
using multiplication. Note that this multiplies the results, not the original values, and our new result of [-0.0, 0.0, 1.0]
is correct.