# Analyze `tf.data` performance with the TF Profiler ## Overview This guide assumes familiarity with the TensorFlow [Profiler](https://www.tensorflow.org/guide/profiler) and [`tf.data`](https://www.tensorflow.org/guide/data). It aims to provide step by step instructions with examples to help users diagnose and fix input pipeline performance issues. To begin, collect a profile of your TensorFlow job. Instructions on how to do so are available for [CPUs/GPUs](https://www.tensorflow.org/guide/profiler#collect_performance_data) and [Cloud TPUs](https://cloud.google.com/tpu/docs/cloud-tpu-tools#capture_profile). ![TensorFlow Trace Viewer](images/data_performance_analysis/trace_viewer.png "The trace viewer page of the TensorFlow Profiler") The analysis workflow detailed below focuses on the trace viewer tool in the Profiler. This tool displays a timeline that shows the duration of ops executed by your TensorFlow program and allows you to identify which ops take the longest to execute. For more information on the trace viewer, check out [this section](https://www.tensorflow.org/guide/profiler#trace_viewer) of the TF Profiler guide. In general, `tf.data` events will appear on the host CPU timeline. ## Analysis Workflow _Please follow the workflow below. If you have feedback to help us improve it, please [create a github issue](https://github.com/tensorflow/tensorflow/issues/new/choose) with the label “comp:data”._ ### 1. Is your `tf.data` pipeline producing data fast enough? Begin by ascertaining whether the input pipeline is the bottleneck for your TensorFlow program. To do so, look for `IteratorGetNext::DoCompute` ops in the trace viewer. In general, you expect to see these at the start of a step. These slices represent the time it takes for your input pipeline to yield a batch of elements when it is requested. If you’re using keras or iterating over your dataset in a `tf.function`, these should be found in `tf_data_iterator_get_next` threads. Note that if you’re using a [distribution strategy](https://www.tensorflow.org/guide/distributed_training), you may see `IteratorGetNextAsOptional::DoCompute` events instead of `IteratorGetNext::DoCompute`(as of TF 2.3). ![image](images/data_performance_analysis/get_next_fast.png "If your IteratorGetNext::DoCompute calls return quickly, `tf.data` is not your bottleneck.") **If the calls return quickly (<= 50 us),** this means that your data is available when it is requested. The input pipeline is not your bottleneck; see the [Profiler guide](https://www.tensorflow.org/guide/profiler) for more generic performance analysis tips. ![image](images/data_performance_analysis/get_next_slow.png "If your IteratorGetNext::DoCompute calls return slowly, `tf.data` is not producing data quickly enough.") **If the calls return slowly,** `tf.data` is unable to keep up with the consumer’s requests. Continue to the next section. ### 2. Are you prefetching data? The best practice for input pipeline performance is to insert a `tf.data.Dataset.prefetch` transformation at the end of your `tf.data` pipeline. This transformation overlaps the input pipeline’s preprocessing computation with the next step of model computation and is required for optimal input pipeline performance when training your model. If you’re prefetching data, you should see a `Iterator::Prefetch` slice on the same thread as the `IteratorGetNext::DoCompute` op. ![image](images/data_performance_analysis/prefetch.png "If you're prefetching data, you should see a `Iterator::Prefetch` slice in the same stack as the `IteratorGetNext::DoCompute` op.") **If you don’t have a `prefetch` at the end of your pipeline**, you should add one. For more information about `tf.data` performance recommendations, see the [tf.data performance guide](https://www.tensorflow.org/guide/data_performance#prefetching). **If you’re already prefetching data**, and the input pipeline is still your bottleneck, continue to the next section to further analyze performance. ### 3. Are you reaching high CPU utilization? `tf.data` achieves high throughput by trying to make the best possible use of available resources. In general, even when running your model on an accelerator like a GPU or TPU, the `tf.data` pipelines are run on the CPU. You can check your utilization with tools like [sar](https://linux.die.net/man/1/sar) and [htop](https://en.wikipedia.org/wiki/Htop), or in the [cloud monitoring console](https://cloud.google.com/monitoring/docs/monitoring_in_console) if you’re running on GCP. **If your utilization is low,** this suggests that your input pipeline may not be taking full advantage of the host CPU. You should consult the [tf.data performance guide](https://www.tensorflow.org/guide/data_performance) for best practices. If you have applied the best practices and utilization and throughput remain low, continue to [Bottleneck analysis](#4_bottleneck_analysis) below. **If your utilization is approaching the resource limit**, in order to improve performance further, you need to either improve the efficiency of your input pipeline (for example, avoiding unnecessary computation) or offload computation. You can improve the efficiency of your input pipeline by avoiding unnecessary computation in `tf.data`. One way of doing this is inserting a [`tf.data.Dataset.cache`](https://www.tensorflow.org/guide/data_performance#caching) transformation after computation-intensive work if your data fits into memory; this reduces computation at the cost of increased memory usage. Additionally, disabling intra-op parallelism in `tf.data` has the potential to increase efficiency by > 10%, and can be done by setting the following option on your input pipeline: ```python dataset = ... options = tf.data.Options() options.experimental_threading.max_intra_op_parallelism = 1 dataset = dataset.with_options(options) ``` ### 4. Bottleneck Analysis The following section walks through how to read `tf.data` events in the trace viewer to understand where the bottleneck is and possible mitigation strategies. #### Understanding `tf.data` events in the Profiler Each `tf.data` event in the Profiler has the name `Iterator::`, where `` is the name of the dataset source or transformation. Each event also has the long name `Iterator::::...::`, which you can see by clicking on the `tf.data` event. In the long name, `` matches `` from the (short) name, and the other datasets in the long name represent downstream transformations. ![image](images/data_performance_analysis/map_long_name.png "tf.data.Dataset.range(10).map(lambda x: x).repeat(2).batch(5)") For example, the above screenshot was generated from the following code: ```python dataset = tf.data.Dataset.range(10) dataset = dataset.map(lambda x: x) dataset = dataset.repeat(2) dataset = dataset.batch(5) ``` Here, the `Iterator::Map` event has the long name `Iterator::BatchV2::FiniteRepeat::Map`. Note that the datasets name may differ slightly from the python API (for example, FiniteRepeat instead of Repeat), but should be intuitive enough to parse. ##### Synchronous and asynchronous transformations For synchronous `tf.data` transformations (such as `Batch` and `Map`), you will see events from upstream transformations on the same thread. In the above example, since all the transformations used are synchronous, all the events appear on the same thread. For asynchronous transformations (such as `Prefetch`, `ParallelMap`, `ParallelInterleave` and `MapAndBatch`) events from upstream transformations will be on a different thread. In such cases, the “long name” can help you identify which transformation in a pipeline an event corresponds to. ![image](images/data_performance_analysis/async_long_name.png "tf.data.Dataset.range(10).map(lambda x: x).repeat(2).batch(5).prefetch(1)") For example, the above screenshot was generated from the following code: ```python dataset = tf.data.Dataset.range(10) dataset = dataset.map(lambda x: x) dataset = dataset.repeat(2) dataset = dataset.batch(5) dataset = dataset.prefetch(1) ``` Here, the `Iterator::Prefetch` events are on the `tf_data_iterator_get_next` threads. Since `Prefetch` is asynchronous, its input events (`BatchV2`) will be on a different thread, and can be located by searching for the long name `Iterator::Prefetch::BatchV2`. In this case, they are on the `tf_data_iterator_resource` thread. From its long name, you can deduce that `BatchV2` is upstream of `Prefetch`. Furthermore, the `parent_id` of the `BatchV2` event will match the ID of the `Prefetch` event. #### Identifying the bottleneck In general, to identify the bottleneck in your input pipeline, walk the input pipeline from the outermost transformation all the way to the source. Starting from the final transformation in your pipeline, recurse into upstream transformations until you find a slow transformation or reach a source dataset, such as `TFRecord`. In the example above, you would start from `Prefetch`, then walk upstream to `BatchV2`, `FiniteRepeat`, `Map`, and finally `Range`. In general, a slow transformation corresponds to one whose events are long, but whose input events are short. Some examples follow below. Note that the final (outermost) transformation in most host input pipelines is the `Iterator::Model` event. The Model transformation is introduced automatically by the `tf.data` runtime and is used for instrumenting and autotuning the input pipeline performance. If your job is using a [distribution strategy](https://www.tensorflow.org/guide/distributed_training), the trace viewer will contain additional events that correspond to the device input pipeline. The outermost transformation of the device pipeline (nested under `IteratorGetNextOp::DoCompute` or `IteratorGetNextAsOptionalOp::DoCompute`) will be an `Iterator::Prefetch` event with an upstream `Iterator::Generator` event. You can find the corresponding host pipeline by searching for `Iterator::Model` events. ##### Example 1 ![image](images/data_performance_analysis/example_1_cropped.png "Example 1") The above screenshot is generated from the following input pipeline: ```python dataset = tf.data.TFRecordDataset(filename) dataset = dataset.map(parse_record) dataset = dataset.batch(32) dataset = dataset.repeat() ``` In the screenshot, observe that (1) `Iterator::Map` events are long, but (2) its input events (`Iterator::FlatMap`) return quickly. This suggests that the sequential Map transformation is the bottleneck. Note that in the screenshot, the `InstantiatedCapturedFunction::Run` event corresponds to the time it takes to execute the map function. ##### Example 2 ![image](images/data_performance_analysis/example_2_cropped.png "Example 2") The above screenshot is generated from the following input pipeline: ```python dataset = tf.data.TFRecordDataset(filename) dataset = dataset.map(parse_record, num_parallel_calls=2) dataset = dataset.batch(32) dataset = dataset.repeat() ``` This example is similar to the above, but uses ParallelMap instead of Map. We notice here that (1) `Iterator::ParallelMap` events are long, but (2) its input events `Iterator::FlatMap` (which are on a different thread, since ParallelMap is asynchronous) are short. This suggests that the ParallelMap transformation is the bottleneck. #### Addressing the bottleneck ##### Source datasets If you’ve identified a dataset source as the bottleneck, such as reading from TFRecord files, you can improve performance by parallelizing data extraction. To do so, ensure that your data is sharded across multiple files and use `tf.data.Dataset.interleave` with the `num_parallel_calls` parameter set to `tf.data.AUTOTUNE`. If determinism is not important to your program, you can further improve performance by setting the `deterministic=False` flag on `tf.data.Dataset.interleave` as of TF 2.2. For example, if you’re reading from TFRecords, you can do the following: ```python dataset = tf.data.Dataset.from_tensor_slices(filenames) dataset = dataset.interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False) ``` Note that sharded files should be reasonably large to amortize the overhead of opening a file. For more details on parallel data extraction, see [this section](https://www.tensorflow.org/guide/data_performance#parallelizing_data_extraction) of the `tf.data` performance guide. ##### Transformation datasets If you’ve identified an intermediate `tf.data` transformation as the bottleneck, you can address it by parallelizing the transformation or [caching the computation](https://www.tensorflow.org/guide/data_performance#caching) if your data fits into memory and it is appropriate. Some transformations such as `Map` have parallel counterparts; the [`tf.data` performance guide demonstrates](https://www.tensorflow.org/guide/data_performance#parallelizing_data_transformation) how to parallelize these. Other transformations, such as `Filter`, `Unbatch`, and `Batch` are inherently sequential; you can parallelize them by introducing “outer parallelism”. For example, supposing your input pipeline initially looks like the following, with `Batch` as the bottleneck: ```python filenames = tf.data.Dataset.list_files(file_path, shuffle=is_training) dataset = filenames_to_dataset(filenames) dataset = dataset.batch(batch_size) ``` You can introduce “outer parallelism” by running multiple copies of the input pipeline over sharded inputs and combining the results: ```python filenames = tf.data.Dataset.list_files(file_path, shuffle=is_training) def make_dataset(shard_index): filenames = filenames.shard(NUM_SHARDS, shard_index) dataset = filenames_to_dataset(filenames) Return dataset.batch(batch_size) indices = tf.data.Dataset.range(NUM_SHARDS) dataset = indices.interleave(make_dataset, num_parallel_calls=tf.data.AUTOTUNE) dataset = dataset.prefetch(tf.data.AUTOTUNE) ``` ## Additional resources * [tf.data performance guide](https://www.tensorflow.org/guide/data_performance) on how to write performance `tf.data` input pipelines * [Inside TensorFlow video: `tf.data` best practices ](https://www.youtube.com/watch?v=ZnukSLKEw34) * [Profiler guide](https://www.tensorflow.org/guide/profiler) * [Profiler tutorial with colab](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras)