Most of this notebook is designed to be run on a Colab TPU. To access TPU on Colab, go to Runtime -> Change runtime type
and choose TPU
. Some parts of the code may need to be changed when running on a Google Cloud TPU VM or TPU Node. We have indicated in the code where these changes may be necessary.
At busy times, you may find that there's a lot of competition for TPUs and it can be hard to get access to a free one on Colab. Keep trying!
This notebook is focused on usable code, but if you'd like a more high-level explanation of how to work with TPUs, please check out our associated TPU tutorial.
First, install up-to-date versions of transformers
and datasets
if you don't have them already.
!pip install --upgrade transformers datasets
We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.
from transformers.utils import send_example_telemetry
send_example_telemetry("tpu_notebook", framework="tensorflow")
This next block will need to be modified depending on how you're accessing the TPU. For Colab, this code should work fine. When running on a TPU VM, pass the argument tpu="local"
to the TPUClusterResolver
. When running on a non-Colab TPU Node, you'll need to pass the address of the TPU resource. When debugging on CPU/GPU, skip this block.
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
# On TPU VMs use this line instead:
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
Strategy
¶In TensorFlow, a Strategy
object determines how models and data should be distributed across workers. There is a TPUStrategy
specifically for TPU. However, when debugging, we recommend starting with the simplest OneDeviceStrategy
to make sure your code works on CPU, and then swapping it for the TPUStrategy
once you're sure it's bug-free.
import tensorflow as tf
strategy = tf.distribute.TPUStrategy(resolver)
# For testing without a TPU use this line instead:
# strategy = tf.distribute.OneDeviceStrategy("/cpu:0")
In order for TPU training to work, you must create the model used for training inside the Strategy.scope()
. However, other things like Hugging Face tokenizers
and the Dataset
do not need to be created in this scope.
For this example we will use CoLA, which is a small and simple binary text classification dataset from the GLUE benchmark.
We also pad all samples to the maximum length, firstly to make it easier to load them as an array, but secondly because this avoids issues with XLA later. For more information on XLA compilation and TPUs, see the associated TPU tutorial.
from transformers import AutoTokenizer
from datasets import load_dataset
import numpy as np
model_checkpoint = "distilbert-base-cased"
dataset = load_dataset("glue", "cola")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# For simplicity, let's just tokenize our dataset as NumPy arrays
# padded to the maximum length. We discuss other options below!
train_data = tokenizer(
dataset["sentence"],
padding="max_length",
truncation=True,
max_length=128,
return_tensors="np",
)
train_data = dict(train_data) # Because the tokenizer returns a dict subclass
train_labels = np.array(dataset["label"])
While preprocessing data you can operate outside of the the strategy.scope()
, but model creation must take place inside it.
from transformers import TFAutoModelForSequenceClassification
with strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)
# You can compile with jit_compile=True when debugging on CPU or GPU to check
# that XLA compilation works. Remember to take it out when actually running
# on TPU, though - XLA compilation will be handled for you when running with a
# TPUStrategy!
model.compile(optimizer="adam")
tf.data.Dataset
¶Keras methods like fit()
can usually accept a broad range of inputs - a list
/tuple
/dict
of np.ndarray
or tf.Tensor
, Python generators, tf.data.Dataset
, and so on. This is not the case on TPU.
On TPU, your input must always be a tf.data.Dataset
. If you pass anything else
to model.fit()
when using a TPUStrategy
, it will try to coerce it into a tf.data.Dataset
. This sometimes works, but will create a lot of console spam and warnings even when it does. As a result, we recommend explicitly creating a tf.data.Dataset
in all cases.
# The batch size will be split among TPU workers
# so we scale it up based on how many of them there are
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
tf_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
tf_dataset = tf_dataset.shuffle(len(tf_dataset))
# You should use drop_remainder on TPU where possible, because a change in the
# batch size will require a new XLA compilation
tf_dataset = tf_dataset.batch(BATCH_SIZE, drop_remainder=True)
tf_dataset
If you made it this far, then this next line should feel very familiar. Note that fit()
doesn't actually need to be in the scope()
, as long as the model and dataset were created there!
model.fit(tf_dataset)
And that's it! You just trained a Hugging Face model on TPU.
Although the code above is perfectly usable, the dataset creation has been very simplified. We padded every sample to the maximum length in the whole dataset, and we also loaded the whole dataset into memory. When your data is too big for this to work, you will need to use a different approach instead.
Below, we're going to list a few possible approaches to try. Note that some of these approaches may not work on Colab or TPU Node, so don't panic if you get errors! We'll try to indicate which code will work where, and what the advantages and disadvantages of each method are. When adapting this code for your own projects, we recommend choosing only one of these approaches, don't try to do them all at once!
TFRecord
¶TFRecord
is the standard tf.data
format for storing training data. For very large training jobs, it's often worth preprocessing your data and storing it all as TFRecord, then building your own tf.data
pipeline on top of it. This is more work, and often requires you to pay for cloud storage, but it works for training on a wide range of devices (including TPU VM, TPU Node and Colab), and allows for truly massive data pipeline throughput.
When converting to TFRecord, it's a good idea to do your preprocessing and tokenization before writing the TFRecord, so you don't have to do it every time the data is loaded. However, if you intend to use train-time augmentations you should be careful not to apply those before writing the TFRecord
, or else you'll get exactly the same augmentation each epoch, which defeats the purpose of augmenting your data in the first place! Instead, you should apply augmentations in the tf.data
pipeline that loads your data.
First, we initialize our TPU. Skip this block if you're running on CPU.
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
# On TPU VMs use this line instead:
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
Next, we load our strategy, dataset, tokenizer and model just like we did in the first example.
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from datasets import load_dataset
strategy = tf.distribute.TPUStrategy(resolver)
# For testing without a TPU use this line instead:
# strategy = tf.distribute.OneDeviceStrategy("/cpu:0")
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
dataset = load_dataset("glue", "cola", split="train")
model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
with strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.compile(optimizer="adam")
Now, let's tokenize our Hugging Face dataset.
Tip: When using this method in practice, you probably won't be able to load your entire dataset in memory - instead, load a chunk of the dataset at a time and convert that to a TFRecord file, then repeat until you've covered the entire dataset, then use the list of all the files to create the TFRecordDataset later. In this example, we'll just create a single file for simplicity.
tokenized_data = tokenizer(
dataset["sentence"],
padding="max_length",
truncation=True,
max_length=128,
return_tensors="np",
)
labels = dataset["label"]
with tf.io.TFRecordWriter("dataset.tfrecords") as file_writer:
for i in range(len(labels)):
features = {
"input_ids": tf.train.Feature(
int64_list=tf.train.Int64List(value=tokenized_data["input_ids"][i])
),
"attention_mask": tf.train.Feature(
int64_list=tf.train.Int64List(value=tokenized_data["attention_mask"][i])
),
"labels": tf.train.Feature(
int64_list=tf.train.Int64List(value=[labels[i]])
),
}
features = tf.train.Features(feature=features)
example = tf.train.Example(features=features)
record_bytes = example.SerializeToString()
file_writer.write(record_bytes)
Now, to load the dataset we build a TFRecordDataset
using the filenames of the file(s) we saved. Ordinarily, you would need to create your own bucket in Google Cloud Storage, upload files to there, and handle authenticating your Python code so it can access it. However, for the sake of this example, we have uploaded the example file to a public bucket for you, so you can get started quickly!
def decode_fn(sample):
features = {
"input_ids": tf.io.FixedLenFeature((128,), dtype=tf.int64),
"attention_mask": tf.io.FixedLenFeature((128,), dtype=tf.int64),
"labels": tf.io.FixedLenFeature((1,), dtype=tf.int64),
}
return tf.io.parse_example(sample, features)
# TFRecordDataset can handle gs:// paths!
tf_dataset = tf.data.TFRecordDataset(["gs://matt-tf-tpu-tutorial-datasets/cola/dataset.tfrecords"])
tf_dataset = tf_dataset.map(decode_fn)
tf_dataset = tf_dataset.shuffle(len(dataset)).batch(BATCH_SIZE, drop_remainder=True)
tf_dataset = tf_dataset.apply(
tf.data.experimental.assert_cardinality(len(labels) // BATCH_SIZE)
)
And now we can simply fit our dataset as before.
model.fit(tf_dataset)
In summary:
TFRecord advantages:
TFRecord disadvantages:
In all of the examples above, we preprocessed data with a tokenizer
, and then loaded the preprocessed data to fit our model. However, there is an alternate approach: The data can be stored in its native format, and the preprocessing can be done in the tf.data
pipeline itself as the data is loaded!
This is probably the most complex approach, but it can be useful if converting to TFRecord
is difficult, such as when you don't want to save preprocessed images. It's especially useful when the dataset you want is already publicly available in cloud storage - this saves you having to create (and pay for!) your own cloud storage bucket.
Many Hugging Face NLP models have complex tokenization schemes that are not yet supported as tf.data
operations, and so this approach will not work for them. However, some (e.g. BERT) do have fully TF compilable tokenization. This is often a great approach for image models, though!
Let's see an example of this in action.
First, we initialize our TPU. Skip this block if you're running on CPU.
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
# On TPU VMs use this line instead:
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
Next, we create our strategy as we did in the first example.
import tensorflow as tf
strategy = tf.distribute.TPUStrategy(resolver)
# For testing without a TPU use this line instead:
# strategy = tf.distribute.OneDeviceStrategy("/cpu:0")
Next, let's download an image dataset. We'll use Hugging Face datasets for this, but you can use any other source too.
from datasets import load_dataset
image_dataset = load_dataset("beans", split="train")
Now, let's get a list of the underlying image file paths and labels.
filenames = image_dataset["image_file_path"]
labels = image_dataset["labels"]
Ordinarily at this point you would need to create your own bucket in Google Cloud Storage, upload the image files to there, and handle authenticating your Python code so it can access it. However, for the sake of this example we have uploaded the images to a public bucket for you, so you can get started quickly!
We'll use this quick conversion below to turn the local filenames in the dataset into gs://
paths in Google Cloud Storage.
# Strip everything but the category directory and filenames
base_filenames = ['/'.join(filename.split('/')[-2:]) for filename in filenames]
# Prepend the google cloud base path to everything instead
gs_paths = ["gs://matt-tf-tpu-tutorial-datasets/beans/"+filename for filename in base_filenames]
tf_dataset = tf.data.Dataset.from_tensor_slices(
{"filename": gs_paths, "labels": labels}
)
tf_dataset = tf_dataset.shuffle(len(tf_dataset))
That was pretty painless, but now we come to the tricky bit. It's extremely important to preprocess data in the way that the model expects. Classes like AutoTokenizer
and AutoImageProcessor
are designed to easily load the exact configuration for any model, so that you're guaranteed that your preprocessing will be correct.
However, this might seem like a problem when we need to do the preprocessing in tf.data
! These classes contain framework-agnostic code which tf.data
will usually not be able to compile into a pipeline. Don't panic, though - for image datasets we can simply get the normalization values from those classes, and then use them in our tf.data
pipeline.
Let's use ViT as our image model, and get the mean
and std
values used to normalize images.
from transformers import AutoImageProcessor
image_model_checkpoint = "google/vit-base-patch16-224"
processor = AutoImageProcessor.from_pretrained(image_model_checkpoint)
image_size = (processor.size["height"], processor.size["width"])
image_mean = processor.image_mean
image_std = processor.image_std
Now we can write a function to load and preprocess the images:
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
def decode_fn(sample):
image_data = tf.io.read_file(sample["filename"])
image = tf.io.decode_jpeg(image_data, channels=3)
image = tf.image.resize(image, image_size)
array = tf.cast(image, tf.float32)
array /= 255.0
array = (array - image_mean) / image_std
array = tf.transpose(array, perm=[2, 0, 1]) # Swap to channels-first
return {"pixel_values": array, "labels": sample["labels"]}
tf_dataset = tf_dataset.map(decode_fn)
tf_dataset = tf_dataset.batch(BATCH_SIZE, drop_remainder=True)
print(tf_dataset.element_spec)
Nice! Now we have a pipeline we can feed our model with. Let's try it!
from transformers import TFAutoModelForImageClassification
with strategy.scope():
model = TFAutoModelForImageClassification.from_pretrained(image_model_checkpoint)
model.compile(optimizer="adam")
model.fit(tf_dataset)
In summary:
tf.data pipeline advantages:
tf.data pipeline disadvantages:
model.prepare_tf_dataset()
¶If you've read any of our other example notebooks, you'll notice we often use the method Dataset.to_tf_dataset()
or its higher-level wrapper model.prepare_tf_dataset()
to convert Hugging Face Datasets to tf.data.Dataset
. These methods can work for TPU, but with several caveats!
The main thing to know is that these methods do not actually convert the entire Hugging Face Dataset
. Instead, they create a tf.data
pipeline that loads samples from the Dataset
. This pipeline uses tf.numpy_function
or Dataset.from_generator()
to access the underlying Dataset
, and as a result the whole pipeline cannot be compiled by TensorFlow. Because of this, and because the pipeline streams from data on a local disc, these methods will not work on Colab TPU or TPU Nodes.
However, if you're running on a TPU VM and you can tolerate TensorFlow throwing some warnings, this method can work! Let's see it in action. By default, the code below will run on CPU so you can try it on Colab, but if you have a TPU VM feel free to try running it on TPU there.
First, we initialize our TPU. Skip this block if you're running on CPU.
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
# On TPU VMs use this line instead:
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
Next, we load our strategy, dataset, tokenizer and model just like we did in the first example.
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from datasets import load_dataset
# By default, we run on CPU so you can try this code on Colab
strategy = tf.distribute.OneDeviceStrategy("/cpu:0")
# When actually running on a TPU VM use this line instead:
# strategy = tf.distribute.TPUStrategy(resolver)
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
dataset = load_dataset("glue", "cola", split="train")
model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
with strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.compile(optimizer="adam")
Next, we add the tokenizer output as columns in the dataset. Since the dataset is stored on disc, this means we can handle data much bigger than our available memory. Once that's done, we can use prepare_tf_dataset
to stream data from the Hugging Face Dataset by wrapping it with a tf.data
pipeline.
def tokenize_function(examples):
return tokenizer(
examples["sentence"], padding="max_length", truncation=True, max_length=128
)
# This will add the tokenizer output to the dataset as new columns
dataset = dataset.map(tokenize_function)
# prepare_tf_dataset() will choose columns that match the model's input names
tf_dataset = model.prepare_tf_dataset(
dataset, batch_size=BATCH_SIZE, shuffle=True, tokenizer=tokenizer
)
And now you can fit this dataset just like before!
model.fit(tf_dataset) # Note - will be very slow if you're on CPU
In summary:
prepare_tf_dataset() advantages:
prepare_tf_dataset() disadvantages: