#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This guide demonstrates how to perform basic training on Tensor Processing Units (TPUs) and TPU Pods, a collection of TPU devices connected by dedicated high-speed network interfaces, with tf.keras
and custom training loops.
TPUs are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. They are available through Google Colab, the TPU Research Cloud, and Cloud TPU.
Before you run this Colab notebook, make sure that your hardware accelerator is a TPU by checking your notebook settings: Runtime > Change runtime type > Hardware accelerator > TPU v2.
Import some necessary libraries, including TensorFlow Datasets:
import tensorflow as tf
import os
import tensorflow_datasets as tfds
TPUs are typically Cloud TPU workers, which are different from the local process running the user's Python program. Thus, you need to do some initialization work to connect to the remote cluster and initialize the TPUs. Note that the tpu
argument to tf.distribute.cluster_resolver.TPUClusterResolver
is a special address just for Colab. If you are running your code on Google Compute Engine (GCE), you should instead pass in the name of your Cloud TPU.
Note: The TPU initialization code has to be at the beginning of your program.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
After the TPU is initialized, you can use manual device placement to place the computation on a single TPU device:
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
with tf.device('/TPU:0'):
c = tf.matmul(a, b)
print("c device: ", c.device)
print(c)
Usually, you run your model on multiple TPUs in a data-parallel way. To distribute your model on multiple TPUs (as well as multiple GPUs or multiple machines), TensorFlow offers the tf.distribute.Strategy
API. You can replace your distribution strategy and the model will run on any given (TPU) device. Learn more in the Distributed training with TensorFlow guide.
Using the tf.distribute.TPUStrategy
option implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in TPUStrategy
.
To demonstrate this, create a tf.distribute.TPUStrategy
object:
strategy = tf.distribute.TPUStrategy(resolver)
To replicate a computation so it can run in all TPU cores, you can pass it into the Strategy.run
API. Below is an example that shows all cores receiving the same inputs (a, b)
and performing matrix multiplication on each core independently. The outputs will be the values from all the replicas.
@tf.function
def matmul_fn(x, y):
z = tf.matmul(x, y)
return z
z = strategy.run(matmul_fn, args=(a, b))
print(z)
Having covered the basic concepts, consider a more concrete example. This section demonstrates how to use the distribution strategy—tf.distribute.TPUStrategy
—to train a Keras model on a Cloud TPU.
Start with a definition of a Sequential
Keras model for image classification on the MNIST dataset. It's no different than what you would use if you were training on CPUs or GPUs. Note that Keras model creation needs to be inside the Strategy.scope
, so the variables can be created on each TPU device. Other parts of the code are not necessary to be inside the Strategy
scope.
def create_model():
regularizer = tf.keras.regularizers.L2(1e-5)
return tf.keras.Sequential(
[tf.keras.layers.Conv2D(256, 3, input_shape=(28, 28, 1),
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.Conv2D(256, 3,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.Dense(128,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.Dense(10,
kernel_regularizer=regularizer)])
This model puts L2 regularization terms on the weights of each layer, so that the custom training loop below can show how you pick them up from Model.losses
.
Efficient use of the tf.data.Dataset
API is critical when using a Cloud TPU. You can learn more about dataset performance in the Input pipeline performance guide.
If you are using TPU Nodes, you need to store all data files read by the TensorFlow Dataset
in Google Cloud Storage (GCS) buckets. If you are using TPU VMs, you can store data wherever you like. For more information on TPU Nodes and TPU VMs, refer to the TPU System Architecture documentation.
For most use cases, it is recommended to convert your data into the TFRecord
format and use a tf.data.TFRecordDataset
to read it. Check the TFRecord and tf.Example tutorial for details on how to do this. It is not a hard requirement and you can use other dataset readers, such as tf.data.FixedLengthRecordDataset
or tf.data.TextLineDataset
.
You can load entire small datasets into memory using tf.data.Dataset.cache
.
Regardless of the data format used, it is strongly recommended that you use large files on the order of 100MB. This is especially important in this networked setting, as the overhead of opening a file is significantly higher.
As shown in the code below, you should use the Tensorflow Datasets tfds.load
module to get a copy of the MNIST training and test data. Note that try_gcs
is specified to use a copy that is available in a public GCS bucket. If you don't specify this, the TPU will not be able to access the downloaded data.
def get_dataset(batch_size, is_training=True):
split = 'train' if is_training else 'test'
dataset, info = tfds.load(name='mnist', split=split, with_info=True,
as_supervised=True, try_gcs=True)
# Normalize the input data.
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255.0
return image, label
dataset = dataset.map(scale)
# Only shuffle and repeat the dataset in training. The advantage of having an
# infinite dataset for training is to avoid the potential last partial batch
# in each epoch, so that you don't need to think about scaling the gradients
# based on the actual batch size.
if is_training:
dataset = dataset.shuffle(10000)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size)
return dataset
You can train your model with Keras Model.fit
and Model.compile
APIs. There is nothing TPU-specific in this step—you write the code as if you were using multiple GPUs and a MirroredStrategy
instead of the TPUStrategy
. You can learn more in the Distributed training with Keras tutorial.
with strategy.scope():
model = create_model()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['sparse_categorical_accuracy'])
batch_size = 200
steps_per_epoch = 60000 // batch_size
validation_steps = 10000 // batch_size
train_dataset = get_dataset(batch_size, is_training=True)
test_dataset = get_dataset(batch_size, is_training=False)
model.fit(train_dataset,
epochs=5,
steps_per_epoch=steps_per_epoch,
validation_data=test_dataset,
validation_steps=validation_steps)
To reduce Python overhead and maximize the performance of your TPU, pass in the steps_per_execution
argument to Keras Model.compile
. In this example, it increases throughput by about 50%:
with strategy.scope():
model = create_model()
model.compile(optimizer='adam',
# Anything between 2 and `steps_per_epoch` could help here.
steps_per_execution = 50,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['sparse_categorical_accuracy'])
model.fit(train_dataset,
epochs=5,
steps_per_epoch=steps_per_epoch,
validation_data=test_dataset,
validation_steps=validation_steps)
You can also create and train your model using tf.function
and tf.distribute
APIs directly. You can use the Strategy.distribute_datasets_from_function
API to distribute the tf.data.Dataset
given a dataset function. Note that in the example below the batch size passed into the Dataset
is the per-replica batch size instead of the global batch size. To learn more, check out the Custom training with tf.distribute.Strategy
tutorial.
First, create the model, datasets and tf.function
s:
# Create the model, optimizer and metrics inside the `tf.distribute.Strategy`
# scope, so that the variables can be mirrored on each device.
with strategy.scope():
model = create_model()
optimizer = tf.keras.optimizers.Adam()
training_loss = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
training_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
'training_accuracy', dtype=tf.float32)
# Calculate per replica batch size, and distribute the `tf.data.Dataset`s
# on each TPU worker.
per_replica_batch_size = batch_size // strategy.num_replicas_in_sync
train_dataset = strategy.distribute_datasets_from_function(
lambda _: get_dataset(per_replica_batch_size, is_training=True))
@tf.function
def train_step(iterator):
"""The step function for one training step."""
def step_fn(inputs):
"""The computation to run on each TPU device."""
images, labels = inputs
with tf.GradientTape() as tape:
logits = model(images, training=True)
per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
labels, logits, from_logits=True)
loss = tf.nn.compute_average_loss(per_example_loss)
model_losses = model.losses
if model_losses:
loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
training_loss.update_state(loss * strategy.num_replicas_in_sync)
training_accuracy.update_state(labels, logits)
strategy.run(step_fn, args=(next(iterator),))
Then, run the training loop:
steps_per_eval = 10000 // batch_size
train_iterator = iter(train_dataset)
for epoch in range(5):
print('Epoch: {}/5'.format(epoch))
for step in range(steps_per_epoch):
train_step(train_iterator)
print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
optimizer.iterations.numpy(),
round(float(training_loss.result()), 4),
round(float(training_accuracy.result()) * 100, 2)))
training_loss.reset_states()
training_accuracy.reset_states()
tf.function
¶You can improve the performance by running multiple steps within a tf.function
. This is achieved by wrapping the Strategy.run
call with a tf.range
inside tf.function
, and AutoGraph will convert it to a tf.while_loop
on the TPU worker. You can learn more about tf.function
s in the Better performance with tf.function
guide.
Despite the improved performance, there are tradeoffs with this method compared to running a single step inside a tf.function
. Running multiple steps in a tf.function
is less flexible—you cannot run things eagerly or arbitrary Python code within the steps.
@tf.function
def train_multiple_steps(iterator, steps):
"""The step function for one training step."""
def step_fn(inputs):
"""The computation to run on each TPU device."""
images, labels = inputs
with tf.GradientTape() as tape:
logits = model(images, training=True)
per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
labels, logits, from_logits=True)
loss = tf.nn.compute_average_loss(per_example_loss)
model_losses = model.losses
if model_losses:
loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
training_loss.update_state(loss * strategy.num_replicas_in_sync)
training_accuracy.update_state(labels, logits)
for _ in tf.range(steps):
strategy.run(step_fn, args=(next(iterator),))
# Convert `steps_per_epoch` to `tf.Tensor` so the `tf.function` won't get
# retraced if the value changes.
train_multiple_steps(train_iterator, tf.convert_to_tensor(steps_per_epoch))
print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
optimizer.iterations.numpy(),
round(float(training_loss.result()), 4),
round(float(training_accuracy.result()) * 100, 2)))
To learn more about Cloud TPUs and how to use them:
tf.distribute.TPUStrategy
—with examples showing best practices.tf.tpu.experimental.embedding
. In addition, TensorFlow Recommenders has tfrs.layers.embedding.TPUEmbedding
. Embeddings provide efficient and dense representations, capturing complex similarities and relationships between features. TensorFlow's TPU-specific embedding support allows you to train embeddings that are larger than the memory of a single TPU device, and to use sparse and ragged inputs on TPUs.