Reading speed of GPU Direct Storage (GDS) is far slower than expectations


I’m trying to use GDS to load data from NVMe SSD directly to GPU. But the performance of GDS is far worse than I expected (and benchmark results I found on the Internet).

Without GDS

At the beginning, when the filesystem mount options did not contain data=ordered, the read/write speed was similar to Storage->CPU->GPU and there was an error message in cufile.log:

 18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] ERROR  cufio-fs:79 mount option not found in mount table data device: /dev/md0
 18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] ERROR  cufio-fs:148 EXT4 journal options not found in mount table for device,can't verify data=ordered mode journalling
 18-11-2022 12:53:19:400 [pid=3937788 tid=3938762] NOTICE  cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled

Thus I unmout the device and mount again with data=ordered.

With GDS

However, after enabling GDS, the sequential read/write speed (xfer_type=0 - Storage->GPU (GDS)) is even lower than transfering without GDS (xfer_type=2 - Storage->CPU->GPU) according to gdsio.

Is this because my GDS installation is not complete? I’d really appreciate it if anyone could point out a way to solve this.

Below are some useful info:

Write speed:

IoType: WRITE XferType: GPUD Threads: 8 DataSetSize: 259806208/16777216(KiB) IOSize: 1024(KiB) Throughput: 4.096943 GiB/sec, Avg_Latency: 1906.918900 usecs ops: 253717 total_time 60.476923 secs
IoType: WRITE XferType: CPUONLY Threads: 8 DataSetSize: 377206784/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.940653 GiB/sec, Avg_Latency: 1315.092447 usecs ops: 368366 total_time 60.554358 secs
IoType: WRITE XferType: CPU_GPU Threads: 8 DataSetSize: 360371200/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.716904 GiB/sec, Avg_Latency: 1366.563535 usecs ops: 351925 total_time 60.115884 secs
IoType: WRITE XferType: CPU_ASYNC_GPU Threads: 8 DataSetSize: 360446976/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.740631 GiB/sec, Avg_Latency: 1360.873393 usecs ops: 351999 total_time 59.880006 secs
IoType: WRITE XferType: CPU_CACHED_GPU Threads: 8 DataSetSize: 1400578048/16777216(KiB) IOSize: 1024(KiB) Throughput: 22.455711 GiB/sec, Avg_Latency: 347.904008 usecs ops: 1367752 total_time 59.481320 secs

Read speed:

IoType: READ XferType: GPUD Threads: 8 DataSetSize: 326001664/16777216(KiB) IOSize: 1024(KiB) Throughput: 5.229893 GiB/sec, Avg_Latency: 1493.813862 usecs ops: 318361 total_time 59.446613 secs
IoType: READ XferType: CPUONLY Threads: 8 DataSetSize: 600714240/16777216(KiB) IOSize: 1024(KiB) Throughput: 9.558972 GiB/sec, Avg_Latency: 817.302324 usecs ops: 586635 total_time 59.931730 secs
IoType: READ XferType: CPU_GPU Threads: 8 DataSetSize: 574359552/16777216(KiB) IOSize: 1024(KiB) Throughput: 9.234180 GiB/sec, Avg_Latency: 846.043356 usecs ops: 560898 total_time 59.317879 secs
IoType: READ XferType: CPU_ASYNC_GPU Threads: 8 DataSetSize: 410250240/16777216(KiB) IOSize: 1024(KiB) Throughput: 6.522218 GiB/sec, Avg_Latency: 1197.779030 usecs ops: 400635 total_time 59.986512 secs
IoType: READ XferType: CPU_CACHED_GPU Threads: 8 DataSetSize: 1407857664/16777216(KiB) IOSize: 1024(KiB) Throughput: 22.440167 GiB/sec, Avg_Latency: 348.157378 usecs ops: 1374861 total_time 59.831893 secs

This is my gdsio configuration:

#IO type, rw=read, rw=write, rw=randread, rw=randwrite
#block size, for variable block size can specify range e.g. bs=1M:4M:1M, (1M : start block size, 4M : end block size, 1M :steps in which size is varied)
#use 1 for enabling verification
#skip cufile buffer registration, ignored in cpu mode
#set up NVlinks, recommended if p2p traffic is cross node
#use random seed
#fill request buffer with random data
#refill io buffer after every write
#use random offsets which are not page-aligned
#file offset to start read/write from

#numa node
#gpu device index (check nvidia-smi)
#For Xfer mode 6, num_threads will be used as batch_size
#enable either directory or filename or url

According to gdscheck, GDS is enabled and NVMe is supported:

 GDS release version:
 nvidia_fs version:  2.12 libcufile version: 2.12
 Platform: x86_64
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 1024
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 18014398509481980
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 GPU index 0 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 4 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 5 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 6 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 7 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 IOMMU: disabled
 Platform verification succeeded

Other system infomation:
System: Ubuntu 20.04
Linux Kernel Version: 5.15.0-52-generic
CUDA Driver Version: 515.65.01
CUDA Version: 11.7
Mount options: /dev/md0 on /mnt/raid0nvme1 type ext4 (rw,relatime,discard,stripe=256,data=ordered)

There is no warns nor errors in cufile.log.

1 Like

What is the NVMe SSD being used here? What GPU is being used? Are both devices connected via PCIe gen 3 or PCIe gen 4 ? Is the GPU on a x16 link?

Throughput of 5GB/sec does not strike me as particularly slow for an NVMe device. What kind of throughput were you expecting and why?

1 Like

Hi njuff, thanks for your reply. Sorry I forgot to mention some essential information.

  1. Two Intel P5510 SSDs that creates a RAID 0 array.
  2. 8 A5000 GPUs
  3. Yes, all devices are connected via PCIe gen 4. And all GPUs are on x16 links.
  4. As the two NVMe devices are in a RAID 0 array, read throughput reported by fio is 12.4GiB/s (13.3GB/s) given a similar configuration.

Here is the fio configuration,


Acutally, as we can also observe from gdsio’s result: reading from SSD, through CPU, to GPU is also faster (9.23 GiB/s) than GDS (5.23 Gib/s), which realy confuses me.

Hi fuyao360,
The performance of GDS in P2P mode depends upon the PCie distance between the GPU and NVMe drives. Best performance is seen when the NVMe and GPU are under a common PCIe switch. so if the NVMe’s are connected over the CPU root port, the performance of p2p might not be optimal.

you also seem to have a RAID0 of two NVMes, If the two NVMe’s are across two different CPU sockets, then P2P might not be efficient.

please read the following doc for benchmarking configuration.

Apparently, that model comes in more than one capacity. For NVME SSDs capacity typically is not orthogonal to read / write performance, so one would have to know the capacity as well to determine the relevant throughput numbers for that model. Looking at published numbers from reviews, Intel P5510 SSD is most frequently reported as providing read throughput of 6500 MB/sec and write throughput of 4200 MB/sec.

I have no experience with RAID configurations and their impact on performance.

GDS is a specialized topic, and does not seem a good fit for a general CUDA programming sub-forum such as this. I tried to find a more appropriate sub-forum but could not find a better match.

This ties in well with the OP achieving an fio read result of twice this, for a pair of them in RAID 0 - 12.4GiB/s (13.3GB/s).

The forum community manager redirected a GDS query to the GPU - Hardware forum.