NVIDIA® GPUDirect® Storage (GDS) needs to be installed to use GDS feature. File access APIs would still work without GDS but you won't see the speed up.
Please follow the release note or the installation guide to install GDS in your host system.
The following examples assumes that files loaded are mounted on the NVMe storage device and assumes that CuPy and PyTorch packages are installed.
Please execute the following commands to install the dependent libraries.
!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0
(If executing `import torch; torch.cuda.is_available()` doesn't show 'True', use `pip` PyTorch installation method.)
or
!pip install cupy-cuda110==9.0.0b3
!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
#!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0
# or
#!pip install cupy-cuda110==9.0.0b3
#!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
You can use either CuFileDriver
class or open
method in cucim.clara.filesystem
package.
A file descriptor would be needed to create a CuFileDriver instance.
To use GDS, the file needs to be opened with os.O_DIRECT
. See NVIDIA GPUDirect Storage O_DIRECT Requirements Guide.
Please also see os.open() for the detailed options available.
import os
from cucim.clara.filesystem import CuFileDriver
fno = os.open( "input/image.tif", os.O_RDONLY | os.O_DIRECT)
fno2 = os.dup(fno)
fd = CuFileDriver(fno, False)
fd.close()
# Do not use GDS even when GDS can be supported for the file.
fd2 = CuFileDriver(fno2, True)
fd2.close()
help(CuFileDriver.__init__)
Help on instancemethod in module cucim.clara._cucim.filesystem: __init__(...) __init__(self: cucim.clara._cucim.filesystem.CuFileDriver, fd: int, no_gds: bool = False, use_mmap: bool = False, file_path: str = '') -> None Constructor of CuFileDriver. Args: fd: A file descriptor (in `int` type) which is available through `os.open()` method. no_gds: If True, use POSIX APIs only even when GDS can be supported for the file. use_mmap: If True, use memory-mapped IO. This flag is supported only for the read-only file descriptor. Default value is `False`. file_path: A file path for the file descriptor. It would retrieve the absolute file path of the file descriptor if not specified.
open()
method in cucim.clara.filesystem package¶cucim.clara.filesystem.open()
method accepts the three parameters (file_path
, flags
, mode
).
A string for the file path.
flags
can be one of the following flag string:
os.O_RDONLY
os.O_RDWR
os.O_RDWR
| os.O_CREAT
| os.O_TRUNC
os.O_RDWR
| os.O_CREAT
In addition to above flags, the method append os.O_CLOEXEC
and os.O_DIRECT
by default.
The following is optional flags that can be added to above string:
When 'm' is used, PROT_READ
and MAP_SHARED
are used for the parameter of mmap() function.
A file mode. Default value is 0o644
.
import cucim.clara.filesystem as fs
fd = fs.open("input/image.tif", "r")
fs.close(fd)
# Open file without using GDS
fd2 = fs.open("input/image.tif", "rp")
fs.close(fd2)
True
You can use pread()
/pwrite()
method in either CuFileDriver
class or cucim.clara.filesystem
package.
Those methods are similar to POSIX pread()&pwrite() methods which requires buf
, count
, and offset
(file_offset
) parameters.
However, for user's convenient, an optional buf_offset
parameter (default value: 0
) is also added to specify an offset of the input/output buffer and it would have 0
if not specified.
Any Python object supporting __array_interface__ (such as numpy.array or numpy.ndarray) can be used for buf
parameter.
Or, any pointer address (int
type) can be used for buf
parameter.
from cucim.clara.filesystem import CuFileDriver
import cucim.clara.filesystem as fs
import os, numpy as np, torch
# Write a file with size 10 (in bytes)
with open("input.raw", "wb") as input_file:
input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))
# Create an array with size 10 (in bytes)
np_arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.uint8)
torch_arr = torch.from_numpy(np_arr) # Note: np_arr shares internal data with torch_arr
# Using CuFileDriver
fno = os.open( "input.raw", os.O_RDONLY)
fd = CuFileDriver(fno)
read_count = fd.pread(np_arr, 8, 0, 2) # read 8 bytes starting from file offset 0 into buffer offset 2
print("{:10} cnt: {} content: {}".format("np_arr", read_count, np_arr))
read_count = fd.pread(np_arr, 10, 0) # read 10 bytes starting from file offset 0
print("{:10} cnt: {} content: {}".format("np_arr", read_count, np_arr))
read_count = fd.pread(torch_arr.data_ptr(), 10, 3) # read 10 bytes starting from file offset 3
print("{:10} cnt: {} content: {}".format("torch_arr", read_count, torch_arr))
fd.close()
fno = os.open("output.raw", os.O_RDWR | os.O_CREAT | os.O_TRUNC)
fd = CuFileDriver(fno)
write_count = fd.pwrite(np_arr, 10, 5) # write 10 bytes from np_array to file starting from offset 5
fd.close()
print("{:10} cnt: {} content: {}".format("output.raw", write_count, list(open("output.raw", "rb").read())))
print()
# Using filesystem package
fd = fs.open("output.raw", "r")
read_count = fs.pread(fd, np_arr, 10, 0) # read 10 bytes starting from offset 0
print("{:10} cnt: {} content: {}".format("np_arr", read_count, np_arr))
fs.close(fd) # same with fd.close()
np_arr cnt: 8 content: [ 1 2 101 102 103 104 105 106 107 108] np_arr cnt: 10 content: [101 102 103 104 105 106 107 108 109 110] torch_arr cnt: 7 content: tensor([104, 105, 106, 107, 108, 109, 110, 108, 109, 110], dtype=torch.uint8) output.raw cnt: 10 content: [0, 0, 0, 0, 0, 104, 105, 106, 107, 108, 109, 110, 108, 109, 110] np_arr cnt: 10 content: [ 0 0 0 0 0 104 105 106 107 108]
True
Any Python object supporting __cuda_array_interface__ (such as cupy.array, cupy.ndarray, or Pytorch Cuda Tensor) can be used for buf
parameter.
Or, any pointer address (int
type) can be used for buf
parameter.
from cucim.clara.filesystem import CuFileDriver
import cucim.clara.filesystem as fs
import os
import cupy as cp
import torch
# Write a file with size 10 (in bytes)
with open("input.raw", "wb") as input_file:
input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))
# Create an array with size 10 (in bytes)
cp_arr = cp.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=cp.uint8)
cuda0 = torch.device('cuda:0')
torch_arr = torch.zeros(10, dtype=torch.uint8, device=cuda0)
# Using CuFileDriver
fno = os.open( "input.raw", os.O_RDONLY | os.O_DIRECT)
fd = CuFileDriver(fno)
read_count = fd.pread(cp_arr, 8, 0, 2) # read 8 bytes starting from file offset 0 into buffer offset 2
print("{:20} cnt: {} content: {}".format("np_arr", read_count, cp_arr))
read_count = fd.pread(cp_arr, 10, 0) # read 10 bytes starting from offset 0
print("{:20} cnt: {} content: {}".format("cp_arr", read_count, np_arr))
read_count = fd.pread(torch_arr, 10, 3) # read 10 bytes starting from offset 3
print("{:20} cnt: {} content: {}".format("torch_arr", read_count, torch_arr))
fd.close()
fno = os.open("output.raw", os.O_RDWR | os.O_CREAT | os.O_TRUNC)
fd = CuFileDriver(fno)
write_count = fd.pwrite(cp_arr, 10, 5) # write 10 bytes from np_array to file starting from offset 5
fd.close()
print("{:20} cnt: {} content: {}".format("output.raw", write_count, list(open("output.raw", "rb").read())))
print()
# Using filesystem package
fd = fs.open("output.raw", "r")
read_count = fs.pread(fd, cp_arr, 10, 0) # read 10 bytes starting from offset 0
print("{:20} cnt: {} content: {}".format("cp_arr", read_count, np_arr))
fs.close(fd) # same with fd.close()
np_arr cnt: 8 content: [ 1 2 101 102 103 104 105 106 107 108] cp_arr cnt: 10 content: [ 0 0 0 0 0 104 105 106 107 108] torch_arr cnt: 7 content: tensor([104, 105, 106, 107, 108, 109, 110, 0, 0, 0], device='cuda:0', dtype=torch.uint8) output.raw cnt: 10 content: [0, 0, 0, 0, 0, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110] cp_arr cnt: 10 content: [ 0 0 0 0 0 104 105 106 107 108]
True
cp_arr.__cuda_array_interface__
{'shape': (10,), 'typestr': '|u1', 'descr': [('', '|u1')], 'stream': 1, 'version': 3, 'strides': None, 'data': (140035445751808, False)}
torch_arr.__cuda_array_interface__
{'typestr': '|u1', 'shape': (10,), 'strides': None, 'data': (140035106013184, False), 'version': 2}
You can use discard_page_cache()
method for discarding system (page) cache for the given file, before any performance measurement on a file.
import cucim.clara.filesystem as fs
fs.discard_page_cache("input/image.tif")
# ... file APIs on `input/image.tif`
Its implementation looks like below
bool discard_page_cache(const char* file_path)
{
int fd = ::open(file_path, O_RDONLY);
if (fd < 0)
{
return false;
}
if (::fdatasync(fd) < 0)
{
return false;
}
if (::posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) < 0)
{
return false;
}
if (::close(fd) < 0)
{
return false;
}
return true;
}
It helps measure accurate file access performance without the effect of the page cache.
Conducted experiments on Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz with Samsung SSD 970 PRO (NVMe SSD, 1TB).
This is for reading 10GB of data
second method(posix + cudamemcpy) : 5.031040154863149
second method(posix+odirect + cudamemcpy) : 4.7419630330987275
second method(gds) : 4.235773948952556
Performance gain: 1.19x
This is for reading 2GB of data
second method(posix) : 1.0681836600415409
second method(posix+odirect) : 0.9496012150775641
second method(gds) : 0.8406150250229985
Performance gain: 1.27x