Notebook

Accessing File with GDS¶

Prerequisite¶

NVIDIA® GPUDirect® Storage (GDS) needs to be installed to use GDS feature. File access APIs would still work without GDS but you won't see the speed up.
Please follow the release note or the installation guide to install GDS in your host system.

Note:: During the GDS installation, you would need MOFED (Mellanox OpenFabrics Enterprise Distribution) installed. MOFED is available at https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed.

The following examples assumes that files loaded are mounted on the NVMe storage device and assumes that CuPy and PyTorch packages are installed.

Please execute the following commands to install the dependent libraries.

!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0
(If executing `import torch; torch.cuda.is_available()` doesn't show 'True', use `pip` PyTorch installation method.)

!pip install cupy-cuda110==9.0.0b3
!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

In [ ]:

#!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0

# or

#!pip install cupy-cuda110==9.0.0b3
#!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Open File¶

You can use either CuFileDriver class or open method in cucim.clara.filesystem package.

Opening/Closing file with CuFileDriver¶

A file descriptor would be needed to create a CuFileDriver instance.

To use GDS, the file needs to be opened with os.O_DIRECT. See NVIDIA GPUDirect Storage O_DIRECT Requirements Guide.

Please also see os.open() for the detailed options available.

In [1]:

import os
from cucim.clara.filesystem import CuFileDriver

fno = os.open( "input/image.tif", os.O_RDONLY | os.O_DIRECT)
fno2 = os.dup(fno) 

fd = CuFileDriver(fno, False)
fd.close()

# Do not use GDS even when GDS can be supported for the file.
fd2 = CuFileDriver(fno2, True)
fd2.close()

help(CuFileDriver.__init__)

Help on instancemethod in module cucim.clara._cucim.filesystem:

__init__(...)
    __init__(self: cucim.clara._cucim.filesystem.CuFileDriver, fd: int, no_gds: bool = False, use_mmap: bool = False, file_path: str = '') -> None
    
    Constructor of CuFileDriver.
    
    Args:
        fd: A file descriptor (in `int` type) which is available through `os.open()` method.
        no_gds: If True, use POSIX APIs only even when GDS can be supported for the file.
        use_mmap: If True, use memory-mapped IO. This flag is supported only for the read-only file descriptor. Default value is `False`.
        file_path: A file path for the file descriptor. It would retrieve the absolute file path of the file descriptor if not specified.

Opening file with `open()` method in cucim.clara.filesystem package¶

cucim.clara.filesystem.open() method accepts the three parameters (file_path, flags, mode).

file_path¶

A string for the file path.

flags¶

flags can be one of the following flag string:

"r" : os.O_RDONLY
"r+" : os.O_RDWR
"w" : os.O_RDWR | os.O_CREAT | os.O_TRUNC
"a" : os.O_RDWR | os.O_CREAT

In addition to above flags, the method append os.O_CLOEXEC and os.O_DIRECT by default.

The following is optional flags that can be added to above string:

'p': Use POSIX APIs only (first try to open with O_DIRECT). It does not use GDS.
'n': Do not add O_DIRECT flag.
'm': Use memory-mapped file. This flag is supported only for the read-only file descriptor.

When 'm' is used, PROT_READ and MAP_SHARED are used for the parameter of mmap() function.

mode¶

A file mode. Default value is 0o644.

In [2]:

import cucim.clara.filesystem as fs

fd = fs.open("input/image.tif", "r")
fs.close(fd)

# Open file without using GDS
fd2 = fs.open("input/image.tif", "rp")
fs.close(fd2)

Out[2]:

True

Read/Write File¶

You can use pread()/pwrite() method in either CuFileDriver class or cucim.clara.filesystem package.

Those methods are similar to POSIX pread()&pwrite() methods which requires buf, count, and offset(file_offset) parameters.

However, for user's convenient, an optional buf_offset parameter (default value: 0) is also added to specify an offset of the input/output buffer and it would have 0 if not specified.

Using CPU memory¶

Any Python object supporting __array_interface__ (such as numpy.array or numpy.ndarray) can be used for buf parameter. Or, any pointer address (int type) can be used for buf parameter.

In [3]:

from cucim.clara.filesystem import CuFileDriver
import cucim.clara.filesystem as fs

import os, numpy as np, torch

# Write a file with size 10 (in bytes)
with open("input.raw", "wb") as input_file:
    input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))

# Create an array with size 10 (in bytes)
np_arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.uint8)
torch_arr = torch.from_numpy(np_arr) # Note: np_arr shares internal data with torch_arr

# Using CuFileDriver
fno = os.open( "input.raw", os.O_RDONLY)
fd = CuFileDriver(fno)
read_count = fd.pread(np_arr, 8, 0, 2)      # read 8 bytes starting from file offset 0 into buffer offset 2
print("{:10} cnt: {}  content: {}".format("np_arr", read_count, np_arr))
read_count = fd.pread(np_arr, 10, 0)      # read 10 bytes starting from file offset 0
print("{:10} cnt: {}  content: {}".format("np_arr", read_count, np_arr))
read_count = fd.pread(torch_arr.data_ptr(), 10, 3)      # read 10 bytes starting from file offset 3
print("{:10} cnt: {}  content: {}".format("torch_arr", read_count, torch_arr))
fd.close()

fno = os.open("output.raw", os.O_RDWR | os.O_CREAT | os.O_TRUNC)
fd = CuFileDriver(fno)
write_count = fd.pwrite(np_arr, 10, 5)      # write 10 bytes from np_array to file starting from offset 5
fd.close()
print("{:10} cnt: {}  content: {}".format("output.raw", write_count, list(open("output.raw", "rb").read())))


print()
# Using filesystem package
fd = fs.open("output.raw", "r")
read_count = fs.pread(fd, np_arr, 10, 0)  # read 10 bytes starting from offset 0
print("{:10} cnt: {}  content: {}".format("np_arr", read_count, np_arr))
fs.close(fd)                              # same with fd.close()

np_arr     cnt: 8  content: [  1   2 101 102 103 104 105 106 107 108]
np_arr     cnt: 10  content: [101 102 103 104 105 106 107 108 109 110]
torch_arr  cnt: 7  content: tensor([104, 105, 106, 107, 108, 109, 110, 108, 109, 110], dtype=torch.uint8)
output.raw cnt: 10  content: [0, 0, 0, 0, 0, 104, 105, 106, 107, 108, 109, 110, 108, 109, 110]

np_arr     cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]

Out[3]:

True

Using GPU memory¶

Any Python object supporting __cuda_array_interface__ (such as cupy.array, cupy.ndarray, or Pytorch Cuda Tensor) can be used for buf parameter. Or, any pointer address (int type) can be used for buf parameter.

In [4]:

from cucim.clara.filesystem import CuFileDriver
import cucim.clara.filesystem as fs

import os
import cupy as cp
import torch

# Write a file with size 10 (in bytes)
with open("input.raw", "wb") as input_file:
    input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))

# Create an array with size 10 (in bytes)
cp_arr = cp.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=cp.uint8)

cuda0 = torch.device('cuda:0')
torch_arr = torch.zeros(10, dtype=torch.uint8, device=cuda0)

# Using CuFileDriver
fno = os.open( "input.raw", os.O_RDONLY | os.O_DIRECT)
fd = CuFileDriver(fno)

read_count = fd.pread(cp_arr, 8, 0, 2)      # read 8 bytes starting from file offset 0 into buffer offset 2
print("{:20} cnt: {}  content: {}".format("np_arr", read_count, cp_arr))
read_count = fd.pread(cp_arr, 10, 0)      # read 10 bytes starting from offset 0
print("{:20} cnt: {}  content: {}".format("cp_arr", read_count, np_arr))
read_count = fd.pread(torch_arr, 10, 3)      # read 10 bytes starting from offset 3
print("{:20} cnt: {}  content: {}".format("torch_arr", read_count, torch_arr))
fd.close()

fno = os.open("output.raw", os.O_RDWR | os.O_CREAT | os.O_TRUNC)
fd = CuFileDriver(fno)
write_count = fd.pwrite(cp_arr, 10, 5)      # write 10 bytes from np_array to file starting from offset 5
fd.close()
print("{:20} cnt: {}  content: {}".format("output.raw", write_count, list(open("output.raw", "rb").read())))

print()
# Using filesystem package
fd = fs.open("output.raw", "r")
read_count = fs.pread(fd, cp_arr, 10, 0)  # read 10 bytes starting from offset 0
print("{:20} cnt: {}  content: {}".format("cp_arr", read_count, np_arr))
fs.close(fd)                              # same with fd.close()

np_arr               cnt: 8  content: [  1   2 101 102 103 104 105 106 107 108]
cp_arr               cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]
torch_arr            cnt: 7  content: tensor([104, 105, 106, 107, 108, 109, 110,   0,   0,   0], device='cuda:0',
       dtype=torch.uint8)
output.raw           cnt: 10  content: [0, 0, 0, 0, 0, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110]

cp_arr               cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]

Out[4]:

True

In [5]:

cp_arr.__cuda_array_interface__

Out[5]:

{'shape': (10,),
 'typestr': '|u1',
 'descr': [('', '|u1')],
 'stream': 1,
 'version': 3,
 'strides': None,
 'data': (140035445751808, False)}

In [6]:

torch_arr.__cuda_array_interface__

Out[6]:

{'typestr': '|u1',
 'shape': (10,),
 'strides': None,
 'data': (140035106013184, False),
 'version': 2}

Discarding system (page) cache for a file¶

You can use discard_page_cache() method for discarding system (page) cache for the given file, before any performance measurement on a file.

import cucim.clara.filesystem as fs

fs.discard_page_cache("input/image.tif")
# ... file APIs on `input/image.tif`

Its implementation looks like below

bool discard_page_cache(const char* file_path)
{
    int fd = ::open(file_path, O_RDONLY);
    if (fd < 0)
    {
        return false;
    }
    if (::fdatasync(fd) < 0)
    {
        return false;
    }
    if (::posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) < 0)
    {
        return false;
    }
    if (::close(fd) < 0)
    {
        return false;
    }
    return true;
}

It helps measure accurate file access performance without the effect of the page cache.

Experiments (for a big file such as .mhd)¶

Conducted experiments on Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz with Samsung SSD 970 PRO (NVMe SSD, 1TB).

This is for reading 10GB of data

  second method(posix + cudamemcpy)         : 5.031040154863149
  second method(posix+odirect + cudamemcpy) : 4.7419630330987275
  second method(gds)                        : 4.235773948952556

Performance gain: 1.19x

This is for reading 2GB of data

second method(posix)         : 1.0681836600415409
  second method(posix+odirect) : 0.9496012150775641
  second method(gds)           : 0.8406150250229985

Performance gain: 1.27x