Understanding the latency of 4K read workload in gdsio

yk1234 · July 21, 2025, 10:27pm

Hi,

I’m using the gdsio benchmark to evaluate the performance of GPU Direct Storage (GDS) with a 4K random read workload. I observed that the average latency in ASYNC mode is significantly higher than in SYNC mode. I’m wondering whether this is expected behavior.

Hardware Setup

CPU: INTEL(R) XEON(R) GOLD 6526Y, 64 cores
GPU: NVIDIA A100-SXM4-40GB
SSD: SAMSUNG MZQL21T9HCJR-00A07, local
GPU and SSD are on the same numa node (1)

Software Setup

Ubuntu 22.04, Linux kernel 5.15.0
MLNX_OFED: MLNX_OFED_LINUX-24.10-1.1.4.0-ubuntu22.04-x86_64
cuda 12.2
GDS release version: 1.11.1.6
nvidia_fs version: 2.22
libcufile version: 2.12
Platform: x86_64
Nvidia: 535.247.01

Here is the output of gdscheck

$ sudo ./gdscheck -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.22 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Supported
 Userspace RDMA     : Supported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Loaded (libcufile_rdma.so)
 --rdma devices        : Configured
 --rdma_device_status  : Up: 1 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : true
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)
 Cuda Driver Version Installed:  12020
 Platform: R283-S93-AAF1-000, Arch: x86_64(Linux 5.15.134)
 Platform verification succeeded

Experiments

Storage → GPU (SYNC)

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 0 -T 20

IoType: RANDREAD XferType: GPUD Threads: 1 DataSetSize: 1072852/4(KiB) IOSize: 4(KiB) Throughput: 0.052870 GiB/sec, Avg_Latency: 72.151905 usecs ops: 268213 total_time 19.352230 secs

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 0 -T 20 -s 32G

IoType: RANDREAD XferType: GPUD Threads: 1 DataSetSize: 812696/33554432(KiB) IOSize: 4(KiB) Throughput: 0.038563 GiB/sec, Avg_Latency: 98.920999 usecs ops: 203174 total_time 20.098367 secs

Storage → CPU

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 1 -T 20

IoType: RANDREAD XferType: CPUONLY Threads: 1 DataSetSize: 1290988/4(KiB) IOSize: 4(KiB) Throughput: 0.064274 GiB/sec, Avg_Latency: 59.350699 usecs ops: 322747 total_time 19.155352 secs

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 1 -T 20 -s 32G

IoType: RANDREAD XferType: CPUONLY Threads: 1 DataSetSize: 1245260/33554432(KiB) IOSize: 4(KiB) Throughput: 0.060835 GiB/sec, Avg_Latency: 62.705600 usecs ops: 311315 total_time 19.521306 secs

Observations
In SYNC mode, increasing the dataset size significantly increases latency in GPU mode (~+26 µs), but the effect is marginal in CPU mode.

Storage → GPU (ASYNC)

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 5 -T 20

IoType: RANDREAD XferType: ASYNC Threads: 1 DataSetSize: 525232/4(KiB) IOSize: 4(KiB) Throughput: 0.025580 GiB/sec, Avg_Latency: 149.071854 usecs ops: 131308 total_time 19.581467 secs

$ sudo ./gdsio -f /mnt/nvme_ext4/32GFile -d 0 -n 1 -w 1 -i 4K -I 2 -x 5 -T 20 -s 32G

IoType: RANDREAD XferType: ASYNC Threads: 1 DataSetSize: 524320/33554432(KiB) IOSize: 4(KiB) Throughput: 0.025199 GiB/sec, Avg_Latency: 151.330340 usecs ops: 131080 total_time 19.843498 secs

Observations
Compared to SYNC mode (~72 µs), ASYNC mode shows more than double the latency (~149 µs). This implies ~70 µs of additional overhead introduced by the asynchronous APIs.

I also checked the cufile.log file in the same directory while running the command in all experiments, and it was empty. I think this suggests that the GDS I/O path is functioning correctly.

Questions

Is the ~2× latency increase when using ASYNC mode a known or expected behavior with GDS for small block sizes (4K)?
Why does increasing dataset size lead to larger latency in Storage → GPU SYNC mode?

yk1234 · July 23, 2025, 10:59pm

I have a few more questions when using Nsight Systems (2025.3.1) to profile my workload.

Sync (-x 0) Result:

Async (-x 5) Result:

(1) On the CUDA HW Track, I observed two blocks labeled cuFileSparseGPU and cuFileCopyGPU. Does this indicate that data movement is executed via CUDA kernels on the GPU? The document states that “All APIs are issued from the CPU, not the GPU” though.
(2) The cuFileReadAsync function appears to take 20.676 µs. Does this mean that the submission of the read request alone (excluding the actual I/O execution) takes approximately 20 µs?

I also implemented a test program (single thread) that issues 4KB read requests using cuFileReadAsync , with polling for completion using CUDA events:

cudaEventRecord(start_event, w.stream);
cuFileReadAsync(cf_handle, gpu_buf, 4096, file_off, gpu_off, done_bytes, stream);
cudaEventRecord(end_event, w.stream);
while(true){
  if (cudaEventQuery(end_event) == cudaSuccess) {
      float latency;
      cudaEventElapsedTime(&latency, start_event, end_event);
      break;
    }
}

(1) Does this program correctly capture the latency of IO via cuFile async API?
(2) According to NVTX, cuFileReadAsync and cuFileStreamSubmitIo each take about 4 µs on the CUDA HW Track. However, on the main thread track, they show durations of 71 µs and 67 µs, respectively. I assume these correspond to the same operations. Why is there such a large discrepancy between the two tracks? What does this imply?
(3) Since both cuFileReadAsync and cuFileStreamSubmitIo appear on the CUDA HW Track, does this mean these functions are executed on the GPU rather than the CPU?
(4) The thread track shows a latency of 71 µs for cuFileReadAsync. Does this represent the time spent submitting the request, excluding the actual I/O execution and completion? If so, is this latency considered normal?

Topic		Replies	Views
Where are the async functions? GPU-Accelerated Libraries gds	2	619	May 3, 2023
Understanding Read and Write Op Counts in Async GDS Operations GPU-Accelerated Libraries gds	0	324	February 27, 2024
NVIDIA GDS output exceeds NVMe device throughput GPU-Accelerated Libraries gds	10	744	January 9, 2024
Reading speed of GPU Direct Storage (GDS) is far slower than expectations CUDA Programming and Performance gds	5	1974	November 22, 2022
GDS performance not as expected GPU-Accelerated Libraries gds	5	1722	July 9, 2023
GDS seems to consume more CPU and memory resources than expected GPU-Accelerated Libraries gds	9	239	January 11, 2025
GDS performance test results are not as expected GPU-Accelerated Libraries gds	2	721	May 24, 2023
Gds tools gdsio ,the Throughput is less then 500M CUDA Programming and Performance cuda	1	1060	August 29, 2022
GDS (nvidia-fs) inconsistent performance (b/w drops significantly) Storage	0	472	October 27, 2023
Getting the best performance from NVIDIA GPUDirect storage APIs (batch io) GPU-Accelerated Libraries gds	5	949	June 29, 2023