The following is the structure of the TIFF file.
Each Image File Directory (IFD) has the information of a sub-resolution (including the main/highest resolution) image as TAGs.
(Above image is from http://paulbourke.net/dataformats/tiff/tiff_summary.pdf [accessed Dec 9th, 2020])
For a tiled-multi-resolution TIFF image, TileWidth
and TileLength
TAGs of an IFD have tile size information, and TileOffsets
and TileByteCounts
TAGs include the information on each tile's the byte offset and the number of (compressed) bytes in the tile.
(This link shows all the TAGs available through the libtiff
library.)
(Above image is from https://www.blackice.com/images/Cisco.GIF and https://docs.nframes.com/input-%2526-output/output-formats/ [accessed July 30th, 2020]])
Since TileOffsets
and TileByteCounts
are an array of numbers to access each tile's raw(compressed) data, it is important to fast-read relevant tiles' compressed RAW image data from the file in any access patterns.
This can happen when a TIFF file is read from a single or multi threads/processes to convert/inference without any optimization.
This access pattern can happen usually on DeepLearning model inference use cases. For inference, only part of images are used, and accessing each tile is not done sequentially.
For example, a list of regions to be loaded/processed can be split into multiple threads/processes so accessing tiles can be out of order. Forthermore, (internal) tiles to be read for a specific region (patch) are not necessarily contiguous (e.g., tile index for position[x, y] (0, 0) and (0, 1) wouldn't be contiguous).
This access pattern usually happens on DeepLearning model training use cases.
To get unbiased weights of the neural network, it is necessary to provide randomized augmented training data during the model training, which means a random partial image region(patch) with the label needs to be picked from possible patch positions and file paths.
In the following experiment, we are exploring the implication of the various file access methods on reading partial images with different access patterns. We didn't experiment with access pattern #3 yet but experiment results for #1 and #2 would give us some insight about the possible improvements.
Information on the TIFF file under experiment:
# 92344 x 81017 pixels (@highest resolution) JPEG-compressed RGB image. Tile size: 256x256
# (input/image2.tif)
- file_size : 3,253,334,884
- tile_count : 114,437 (at the highest resolution)
- min_tile_bytecount: 1,677
- max_tile_bytecount: 31,361
- avg_tile_bytecount: 17,406.599404038905
- min_tile_offset : 1,373,824
- max_tile_offset : 1,993,332,840
We tried to load all tiles' raw data in the 3GB TIFF image 1) sequentially and 2) randomly with the following methods:
Using pread() with a regular file descriptor, read each tile's raw (compressed) data into CPU memory.
import cucim.clara.filesystem as fs
fd = fs.open("image2.tif", "rnp")
...
fd.close()
Using pread() with a file descriptor having O_DIRECT
flag, read each tile's raw (compressed) data into CPU memory.
cuCIM's filesystem API handles unaligned memory/file offset for direct access (O_DIRECT).
import cucim.clara.filesystem as fs
fd = fs.open("image2.tif", "rp")
...
fd.close()
Load necessary whole data block at once (with O_DIRECT flag) that is necessary to access all tiles at the highest-resolution, into the temporary CPU memory. Then, copy the necessary data for each tile into the target buffer.
Use mmap() methods internally.
import cucim.clara.filesystem as fs
fd = fs.open("image2.tif", "rm")
...
fd.close()
Note:: Actual experiment was done with C++ implementation/APIs.
Link to the spreadsheet: https://docs.google.com/spreadsheets/d/1DbPe0m2KRqlEFbZZTmP9rhDLZdG6mn_Uv8_DrZy97Uc/edit#gid=1257255419
O_DIRECT
)O_DIRECT
doesn't perform well due to its unaligned memory accessO_DIRECT pre-load
approach performs best, and using mmap
performs better than Regular POSIX
or O_DIRECT
methodsO_DIRECT pre-load
approach requires more CPU memory for pre-loading data so it may not good for the use case where only very small numbers of patches are needed from the file or the list of the patches to load from the file is not available in advance.mmap
for accessing TIFF tiles is a viable solution (current OpenSlide and cuCIM is using Regular POSIX APIs to access tile data) for improving cuCIM's performance and we can leverage O_DIRECT pre-load
approach depending on the workflow.The following variables were changed according to the configuration.
constexpr bool SHUFFLE_LIST = true;
constexpr int iter_max = 32;
constexpr int skip_count = 2;
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "cuslide/tiff/tiff.h"
#include "config.h"
#include <catch2/catch_test_macros.hpp>
#include <fmt/format.h>
#include <cucim/filesystem/cufile_driver.h>
#include <cstdlib>
#include <ctime>
#include <fcntl.h>
#include <unistd.h>
#include <cucim/logger/timer.h>
#include <iostream>
#include <fstream>
#include <sys/stat.h>
#include <sys/mman.h>
#define ALIGN_UP(x, align_to) (((uint64_t)(x) + ((uint64_t)(align_to)-1)) & ~((uint64_t)(align_to)-1))
#define ALIGN_DOWN(x, align_to) ((uint64_t)(x) & ~((uint64_t)(align_to)-1))
static void shuffle_offsets(uint32_t count, uint64_t* offsets, uint64_t* bytecounts)
{
// Fisher-Yates shuffle
for (int i = 0; i < count; ++i)
{
int j = (std::rand() % (count - i)) + i;
std::swap(offsets[i], offsets[j]);
std::swap(bytecounts[i], bytecounts[j]);
}
}
TEST_CASE("Verify raw tiff read", "[test_read_rawtiff.cpp]")
{
cudaError_t cuda_status;
int err;
constexpr int BLOCK_SECTOR_SIZE = 4096;
constexpr bool SHUFFLE_LIST = true;
constexpr int iter_max = 32;
constexpr int skip_count = 2;
std::srand(std::time(nullptr));
auto input_file = g_config.input_file.c_str(); // "/nvme/image2.tif"
struct stat sb;
auto fd_temp = ::open(input_file, O_RDONLY);
fstat(fd_temp, &sb);
uint64_t test_file_size = sb.st_size;
::close(fd_temp);
auto tif = std::make_shared<cuslide::tiff::TIFF>(input_file, O_RDONLY);
tif->construct_ifds();
tif->ifd(0)->write_offsets_(input_file);
std::ifstream offsets(fmt::format("{}.offsets", input_file), std::ios::in | std::ios::binary);
std::ifstream bytecounts(fmt::format("{}.bytecounts", input_file), std::ios::in | std::ios::binary);
// Read image piece count
uint32_t image_piece_count_ = 0;
offsets.read(reinterpret_cast<char*>(&image_piece_count_), sizeof(image_piece_count_));
bytecounts.read(reinterpret_cast<char*>(&image_piece_count_), sizeof(image_piece_count_));
uint64_t image_piece_offsets_[image_piece_count_];
uint64_t image_piece_bytecounts_[image_piece_count_];
uint64_t min_bytecount = 9999999999;
uint64_t max_bytecount = 0;
uint64_t sum_bytecount = 0;
uint64_t min_offset = 9999999999;
uint64_t max_offset = 0;
for (uint32_t i = 0; i < image_piece_count_; i++)
{
offsets.read((char*)&image_piece_offsets_[i], sizeof(image_piece_offsets_[i]));
bytecounts.read((char*)&image_piece_bytecounts_[i], sizeof(image_piece_bytecounts_[i]));
min_bytecount = std::min(min_bytecount, image_piece_bytecounts_[i]);
max_bytecount = std::max(max_bytecount, image_piece_bytecounts_[i]);
sum_bytecount += image_piece_bytecounts_[i];
min_offset = std::min(min_offset, image_piece_offsets_[i]);
max_offset = std::max(max_offset, image_piece_offsets_[i] + image_piece_bytecounts_[i]);
}
bytecounts.close();
offsets.close();
fmt::print("file_size : {}\n", test_file_size);
fmt::print("min_bytecount: {}\n", min_bytecount);
fmt::print("max_bytecount: {}\n", max_bytecount);
fmt::print("avg_bytecount: {}\n", static_cast<double>(sum_bytecount) / image_piece_count_);
fmt::print("min_offset : {}\n", min_offset);
fmt::print("max_offset : {}\n", max_offset);
uint64_t test_size = max_offset + max_bytecount;
// Shuffle offsets
if (SHUFFLE_LIST)
{
shuffle_offsets(image_piece_count_, image_piece_offsets_, image_piece_bytecounts_);
}
// Allocate memory
uint8_t* unaligned_host = static_cast<uint8_t*>(malloc(test_file_size + BLOCK_SECTOR_SIZE * 2));
uint8_t* buffer_host = static_cast<uint8_t*>(malloc(test_file_size + BLOCK_SECTOR_SIZE * 2));
uint8_t* aligned_host = reinterpret_cast<uint8_t*>(ALIGN_UP(unaligned_host, BLOCK_SECTOR_SIZE));
cucim::filesystem::discard_page_cache(input_file);
fmt::print("count:{} \n", image_piece_count_);
SECTION("Regular POSIX")
{
fmt::print("Regular POSIX\n");
double total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rpn");
{
cucim::logger::Timer timer("- read whole : {:.7f}\n", true, false);
ssize_t read_cnt = fd->pread(aligned_host, test_file_size, 0);
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read whole average: {}\n", total_elapsed_time / (iter_max - skip_count));
total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rpn");
{
cucim::logger::Timer timer("- read tiles : {:.7f}\n", true, false);
for (uint32_t i = 0; i < image_piece_count_; ++i)
{
ssize_t read_cnt = fd->pread(aligned_host, image_piece_bytecounts_[i], image_piece_offsets_[i]);
}
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read tiles average: {}\n", total_elapsed_time / (iter_max - skip_count));
}
SECTION("O_DIRECT")
{
fmt::print("O_DIRECT\n");
double total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rp");
{
cucim::logger::Timer timer("- read whole : {:.7f}\n", true, false);
ssize_t read_cnt = fd->pread(aligned_host, test_file_size, 0);
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read whole average: {}\n", total_elapsed_time / (iter_max - skip_count));
total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rp");
{
cucim::logger::Timer timer("- read tiles : {:.7f}\n", true, false);
for (uint32_t i = 0; i < image_piece_count_; ++i)
{
ssize_t read_cnt = fd->pread(buffer_host, image_piece_bytecounts_[i], image_piece_offsets_[i]);
}
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read tiles average: {}\n", total_elapsed_time / (iter_max - skip_count));
}
SECTION("O_DIRECT pre-load")
{
fmt::print("O_DIRECT pre-load\n");
size_t file_start_offset = ALIGN_DOWN(min_offset, BLOCK_SECTOR_SIZE);
size_t end_boundary_offset = ALIGN_UP(max_offset + max_bytecount, BLOCK_SECTOR_SIZE);
size_t large_block_size = end_boundary_offset - file_start_offset;
fmt::print("- size:{}\n", end_boundary_offset - file_start_offset);
double total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rp");
{
cucim::logger::Timer timer("- preload : {:.7f}\n", true, false);
ssize_t read_cnt = fd->pread(aligned_host, large_block_size, file_start_offset);
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Preload average: {}\n", total_elapsed_time / (iter_max - skip_count));
total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rp");
{
cucim::logger::Timer timer("- read tiles : {:.7f}\n", true, false);
for (uint32_t i = 0; i < image_piece_count_; ++i)
{
memcpy(buffer_host, aligned_host + image_piece_offsets_[i] - file_start_offset,
image_piece_bytecounts_[i]);
}
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read tiles average: {}\n", total_elapsed_time / (iter_max - skip_count));
}
SECTION("mmap")
{
fmt::print("mmap\n");
double total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd_mmap = open(input_file, O_RDONLY);
{
cucim::logger::Timer timer("- open/close : {:.7f}\n", true, false);
void* mmap_host = mmap((void*)0, test_file_size, PROT_READ, MAP_SHARED, fd_mmap, 0);
REQUIRE(mmap_host != MAP_FAILED);
if (mmap_host != MAP_FAILED)
{
REQUIRE(munmap(mmap_host, test_file_size) != -1);
close(fd_mmap);
}
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- mmap/munmap average: {}\n", total_elapsed_time / (iter_max - skip_count));
total_elapsed_time = 0;
for (int iter = 0; iter < iter_max; ++iter)
{
cucim::filesystem::discard_page_cache(input_file);
auto fd = cucim::filesystem::open(input_file, "rm");
{
cucim::logger::Timer timer("- read tiles : {:.7f}\n", true, false);
for (uint32_t i = 0; i < image_piece_count_; ++i)
{
ssize_t read_cnt = fd->pread(buffer_host, image_piece_bytecounts_[i], image_piece_offsets_[i]);
}
double elapsed_time = timer.stop();
if (iter >= skip_count)
{
total_elapsed_time += elapsed_time;
}
timer.print();
}
}
fmt::print("- Read tiles average: {}\n", total_elapsed_time / (iter_max - skip_count));
}
free(unaligned_host);
free(buffer_host);
}