GDS performance not as expected

Hey there,
I need some help with Nvidia GPU Direct Storage performance. My initial thought was that using GDS would increase the throughput but in my case using CPU->Storage or GPU->Storage performs the same.

Let me first start, how everything is set up:

CPU: AMD 
Ryzen 5900X
GPU: Nvidia RTX A5000
NVMe: 2x Samsung 980 Pro (only one is mounted (nvme1n1), the other (nvme0n1) contains the OS installation and is not used for GDS)
Ubuntu: 20.04.6 LTS, Kernel: 5.15.0-73-generic
MLNX_OFED version: MLNX_OFED_LINUX-5.8-2.0.3.0
GDS release version: 1.6.1.9
nvidia_fs version:  2.15
libcufile version: 2.12
Nvidia driver: 530.30.02
CUDA version: 12.1
Platform: x86_64
IOMMU and ACS are disabled

Everything runs locally, no NICs or others. No RAID configuration. I just want to exchange data directly between NVMe and GPU excluding any CPU traffic or interaction. All additional informations are attached as .TXT files.

Following gdsio commands were executed with the corresponding results:

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 1 -T 120
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 258038784/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.048213 GiB/sec, Avg_Latency: 1907.022613 usecs ops: 251991 total_time 120.146179 secs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 391164928/2048000(KiB) IOSize: 1024(KiB) Throughput: 3.095833 GiB/sec, Avg_Latency: 1261.758202 usecs ops: 381997 total_time 120.498722 secs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 1 -I 1 -T 120
IoType: WRITE XferType: CPUONLY Threads: 4 DataSetSize: 317429760/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.512381 GiB/sec, Avg_Latency: 1554.694700 usecs ops: 309990 total_time 120.493104 secs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 1 -I 0 -T 120
IoType: READ XferType: CPUONLY Threads: 4 DataSetSize: 370682880/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.958396 GiB/sec, Avg_Latency: 1320.380376 usecs ops: 361995 total_time 119.494064 secs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo rmmod nvidia-fs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 1 -T 120
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 321523712/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.549138 GiB/sec, Avg_Latency: 1532.274661 usecs ops: 313988 total_time 120.287275 secs

krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 376827904/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.992857 GiB/sec, Avg_Latency: 1305.098689 usecs ops: 367996 total_time 120.076275 secs

As you might see, I changed the XFER_TYPE (“-x <0,1>”) running the benchmark to see whether it makes any difference in throughput, but the throughput seems to be identical between “-x 0” (GPU Direct) and “-x 1” (CPU Only). Thus I was running the command “top” while running gdsio with “-x 0” and observed that there is a CPU usage of around 10% by gdsio although I assumed there should be almost no load on the CPU.

I already made sure that it’s not running in compatible mode by looking at cufile.log after running gdsio. But there was no output, seems like everything was alright. cufile.log only indicated that gdsio was running in compatible mode after I removed the “nvidia-fs” module and ran another gdsio session, which makes sense to me.

As a crosscheck I was running fio with the same NVMe to see which throughput can be expected. See the following results:

krb@wslinux-rdma:~/cuda/gds/tools$ sudo fio --rw=read --name=test --size=10G --numjobs=8 --filename=/dev/nvme1n1 --allow_mounted_write=1

[...]

Run status group 0 (all jobs):
   READ: bw=11.4GiB/s (12.3GB/s), 1465MiB/s-1465MiB/s (1536MB/s-1536MB/s), io=80.0GiB (85.9GB), run=6992-6992msec
krb@wslinux-rdma:~/cuda/gds/tools$ sudo fio --rw=write --name=test --size=10G --numjobs=8 --filename=/dev/nvme1n1 --allow_mounted_write=1

[...]

Run status group 0 (all jobs):
  WRITE: bw=9511MiB/s (9973MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=80.0GiB (85.9GB), run=8612-8613msec

As you can see, the expected throughput should be around 10 GiB/s as stated by fio, whereby I’m not sure how comparable the results of fio with gdsio are. The topology for the PCIe devices and in regards of the root complex is stored in “lspci.tv.txt”. I’m not sure whether the pci topology is causing the slow throughput but since GDS is working fine, it is likely that the data is redirected by the CPU to the storage and vice versa.

Can you please help and explain this circumstance?

cufile.json.txt (10.8 KB)
disk.by-path.txt (409 Bytes)
fio-read.txt (11.8 KB)
fio-write.txt (12.2 KB)
gds_stats.txt (260 Bytes)
gdscheck.py.txt (701 Bytes)
gdscheck.txt (2.3 KB)
gdsio.txt (1.9 KB)
lscpi.tv.txt (2.7 KB)
nv.topology.txt (290 Bytes)
mount.ext4.txt (237 Bytes)

I would appreciate any help or hints on what to look for

I have reinstalled everything and switched the OS from nvme0n1 to nvme1n1 just to make sure whether there is any bottleneck because of the PCI topology. Unfortunately, it did not help.

I was running the following commands in 3 separate terminals simultaneously:

  • sudo ./gdsio -D /mnt/nvme0n1/gds_dir -d 0 -w 8 -s 1G -i 1M -x 0 -I 0 -T 300
  • iostat -cxzk 1
  • nvidia-smi dmon -i 0 -s putcm

The result of the both monitoring commands is attached as a PNG. The monitoring results show that %iowait of the CPU is relatively high. Is this already an indication that the data is redirected via CPU?

I also checked if the results from fio makes sense at all and the manufacturers specifications for the Samsung 980 (non-Pro, a mistake I did above) is a read/write throughput of ~ 3.5/3.0 GB/s. So I guess the results from fio are for some reason not suitable for comparisons to gdsio or my used CLI arguments are wrong.

Thanks for the detailed information.

with FIO you are not testing the buffered mode, which is using pagecache.
to test the bandwidth from disk. you can use direct=1 1. fio - Flexible I/O tester rev. 3.35 — fio 3.35-6-g1b4b-dirty documentation

The GPUD, CPUONLY and CPU_GPU look good as the disk bandwidth is the bottleneck(PCIe Gen 3 x4) and not the mode of transfer here. with GPUD you would see slight decrease in the CPU usage at these IO sizes.

GPUD still consumes CPU as the App is host code and the control path still uses CPU. Data flows directly from storage to GPU using GPUDirect DMA.

Hope this helps.

1 Like

Hey @kmodukuri,

I was running some tests and monitoring to check if I can see the decrease in CPU usage for using different transfering types “-x [0-2]”. Here are the results:

  • gdsio command: sudo ./gdsio -D /mnt/nvme0n1/gds_dir/ -d 0 -w 8 -s 1G -i 128K -x 0 -I 0 -T 300

    gdsio result: IoType: READ XferType: GPUD Threads: 8 DataSetSize: 843705216/8388608(KiB) IOSize: 128(KiB) Throughput: 2.681960 GiB/sec, Avg_Latency: 364.114332 usecs ops: 6591447 total_time 300.011981 secs

  • gdsio command: sudo ./gdsio -D /mnt/nvme0n1/gds_dir/ -d 0 -w 8 -s 1G -i 128K -x 1 -I 0 -T 300

    gdsio result: IoType: READ XferType: CPUONLY Threads: 8 DataSetSize: 841726336/8388608(KiB) IOSize: 128(KiB) Throughput: 2.678544 GiB/sec, Avg_Latency: 364.578803 usecs ops: 6575987 total_time 299.689988 secs

  • gdsio command: sudo ./gdsio -D /mnt/nvme0n1/gds_dir/ -d 0 -w 8 -s 1G -i 128K -x 2 -I 0 -T 300

    gdsio result: IoType: READ XferType: CPU_GPU Threads: 8 DataSetSize: 840902912/8388608(KiB) IOSize: 128(KiB) Throughput: 2.677671 GiB/sec, Avg_Latency: 364.697513 usecs ops: 6569554 total_time 299.494429 secs

I would say one can clearly see that the CPU usage is dropping by around 30 percentage points between DGPU and CPU_GPU, whereas the latter is the one used without DMA and the default data path when storing data from the GPU memory. Seems good I guess, thanks.

Using “–direct=1” for fio results in even slower throughputs than using gdsio.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.