Problem
I’m trying to use GDS to load data from NVMe SSD directly to GPU. But the performance of GDS is far worse than I expected (and benchmark results I found on the Internet).
Without GDS
At the beginning, when the filesystem mount options did not contain data=ordered
, the read/write speed was similar to Storage->CPU->GPU and there was an error message in cufile.log:
18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] ERROR cufio-fs:79 mount option not found in mount table data device: /dev/md0
18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] ERROR cufio-fs:148 EXT4 journal options not found in mount table for device,can't verify data=ordered mode journalling
18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] NOTICE cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
Thus I unmout the device and mount again with data=ordered
.
With GDS
However, after enabling GDS, the sequential read/write speed (xfer_type=0 - Storage->GPU (GDS)) is even lower than transfering without GDS (xfer_type=2 - Storage->CPU->GPU) according to gdsio.
Is this because my GDS installation is not complete? I’d really appreciate it if anyone could point out a way to solve this.
Below are some useful info:
Write speed:
IoType: WRITE XferType: GPUD Threads: 8 DataSetSize: 259806208/16777216(KiB) IOSize: 1024(KiB) Throughput: 4.096943 GiB/sec, Avg_Latency: 1906.918900 usecs ops: 253717 total_time 60.476923 secs
IoType: WRITE XferType: CPUONLY Threads: 8 DataSetSize: 377206784/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.940653 GiB/sec, Avg_Latency: 1315.092447 usecs ops: 368366 total_time 60.554358 secs
IoType: WRITE XferType: CPU_GPU Threads: 8 DataSetSize: 360371200/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.716904 GiB/sec, Avg_Latency: 1366.563535 usecs ops: 351925 total_time 60.115884 secs
IoType: WRITE XferType: CPU_ASYNC_GPU Threads: 8 DataSetSize: 360446976/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.740631 GiB/sec, Avg_Latency: 1360.873393 usecs ops: 351999 total_time 59.880006 secs
IoType: WRITE XferType: CPU_CACHED_GPU Threads: 8 DataSetSize: 1400578048/16777216(KiB) IOSize: 1024(KiB) Throughput: 22.455711 GiB/sec, Avg_Latency: 347.904008 usecs ops: 1367752 total_time 59.481320 secs
Read speed:
IoType: READ XferType: GPUD Threads: 8 DataSetSize: 326001664/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.229893 GiB/sec, Avg_Latency: 1493.813862 usecs ops: 318361 total_time 59.446613 secs
IoType: READ XferType: CPUONLY Threads: 8 DataSetSize: 600714240/16777216(KiB) IOSize: 1024(KiB) Throughput: 9.558972 GiB/sec, Avg_Latency: 817.302324 usecs ops: 586635 total_time 59.931730 secs
IoType: READ XferType: CPU_GPU Threads: 8 DataSetSize: 574359552/16777216(KiB) IOSize: 1024(KiB) Throughput: 9.234180 GiB/sec, Avg_Latency: 846.043356 usecs ops: 560898 total_time 59.317879 secs
IoType: READ XferType: CPU_ASYNC_GPU Threads: 8 DataSetSize: 410250240/16777216(KiB) IOSize: 1024(KiB) Throughput: 6.522218 GiB/sec, Avg_Latency: 1197.779030 usecs ops: 400635 total_time 59.986512 secs
IoType: READ XferType: CPU_CACHED_GPU Threads: 8 DataSetSize: 1407857664/16777216(KiB) IOSize: 1024(KiB) Throughput: 22.440167 GiB/sec, Avg_Latency: 348.157378 usecs ops: 1374861 total_time 59.831893 secs
This is my gdsio configuration:
[global]
name=gds-test
#0,1,2,3,4,5,6
xfer_type=$GDSIO_XFER_TYPE
#IO type, rw=read, rw=write, rw=randread, rw=randwrite
rw=read
#block size, for variable block size can specify range e.g. bs=1M:4M:1M, (1M : start block size, 4M : end block size, 1M :steps in which size is varied)
bs=1M
#file-size
size=2G
#secs
runtime=60
#use 1 for enabling verification
do_verify=0
#skip cufile buffer registration, ignored in cpu mode
skip_bufregister=0
#set up NVlinks, recommended if p2p traffic is cross node
enable_nvlinks=0
#use random seed
random_seed=0
#fill request buffer with random data
fill_random=0
#refill io buffer after every write
refill_buffer=0
#use random offsets which are not page-aligned
unaligned_random=0
#file offset to start read/write from
start_offset=0
[job1]
#numa node
numa_node=0
#gpu device index (check nvidia-smi)
gpu_dev_id=0
#For Xfer mode 6, num_threads will be used as batch_size
num_threads=$GDSIO_NUM_THREADS
#enable either directory or filename or url
directory=/mnt/raid0nvme1/sl
According to gdscheck, GDS is enabled and NVMe is supported:
GDS release version: 1.3.1.18
nvidia_fs version: 2.12 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 1024
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 18014398509481980
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 2 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 3 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 4 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 5 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 6 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 7 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
Other system infomation:
System: Ubuntu 20.04
Linux Kernel Version: 5.15.0-52-generic
CUDA Driver Version: 515.65.01
CUDA Version: 11.7
Mount options: /dev/md0 on /mnt/raid0nvme1 type ext4 (rw,relatime,discard,stripe=256,data=ordered)
There is no warns nor errors in cufile.log.