Hey there,
I need some help with Nvidia GPU Direct Storage performance. My initial thought was that using GDS would increase the throughput but in my case using CPU->Storage or GPU->Storage performs the same.
Let me first start, how everything is set up:
CPU: AMD
Ryzen 5900X
GPU: Nvidia RTX A5000
NVMe: 2x Samsung 980 Pro (only one is mounted (nvme1n1), the other (nvme0n1) contains the OS installation and is not used for GDS)
Ubuntu: 20.04.6 LTS, Kernel: 5.15.0-73-generic
MLNX_OFED version: MLNX_OFED_LINUX-5.8-2.0.3.0
GDS release version: 1.6.1.9
nvidia_fs version: 2.15
libcufile version: 2.12
Nvidia driver: 530.30.02
CUDA version: 12.1
Platform: x86_64
IOMMU and ACS are disabled
Everything runs locally, no NICs or others. No RAID configuration. I just want to exchange data directly between NVMe and GPU excluding any CPU traffic or interaction. All additional informations are attached as .TXT files.
Following gdsio commands were executed with the corresponding results:
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 1 -T 120
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 258038784/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.048213 GiB/sec, Avg_Latency: 1907.022613 usecs ops: 251991 total_time 120.146179 secs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 391164928/2048000(KiB) IOSize: 1024(KiB) Throughput: 3.095833 GiB/sec, Avg_Latency: 1261.758202 usecs ops: 381997 total_time 120.498722 secs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 1 -I 1 -T 120
IoType: WRITE XferType: CPUONLY Threads: 4 DataSetSize: 317429760/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.512381 GiB/sec, Avg_Latency: 1554.694700 usecs ops: 309990 total_time 120.493104 secs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 1 -I 0 -T 120
IoType: READ XferType: CPUONLY Threads: 4 DataSetSize: 370682880/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.958396 GiB/sec, Avg_Latency: 1320.380376 usecs ops: 361995 total_time 119.494064 secs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo rmmod nvidia-fs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 1 -T 120
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 321523712/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.549138 GiB/sec, Avg_Latency: 1532.274661 usecs ops: 313988 total_time 120.287275 secs
krb@wslinux-rdma:~/cuda/gds/tools$ sudo ./gdsio -D /mnt/nvme_gds/gds_dir -d 0 -w 4 -s 500M -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 376827904/2048000(KiB) IOSize: 1024(KiB) Throughput: 2.992857 GiB/sec, Avg_Latency: 1305.098689 usecs ops: 367996 total_time 120.076275 secs
As you might see, I changed the XFER_TYPE (“-x <0,1>”) running the benchmark to see whether it makes any difference in throughput, but the throughput seems to be identical between “-x 0” (GPU Direct) and “-x 1” (CPU Only). Thus I was running the command “top” while running gdsio with “-x 0” and observed that there is a CPU usage of around 10% by gdsio although I assumed there should be almost no load on the CPU.
I already made sure that it’s not running in compatible mode by looking at cufile.log after running gdsio. But there was no output, seems like everything was alright. cufile.log only indicated that gdsio was running in compatible mode after I removed the “nvidia-fs” module and ran another gdsio session, which makes sense to me.
As a crosscheck I was running fio with the same NVMe to see which throughput can be expected. See the following results:
krb@wslinux-rdma:~/cuda/gds/tools$ sudo fio --rw=read --name=test --size=10G --numjobs=8 --filename=/dev/nvme1n1 --allow_mounted_write=1
[...]
Run status group 0 (all jobs):
READ: bw=11.4GiB/s (12.3GB/s), 1465MiB/s-1465MiB/s (1536MB/s-1536MB/s), io=80.0GiB (85.9GB), run=6992-6992msec
krb@wslinux-rdma:~/cuda/gds/tools$ sudo fio --rw=write --name=test --size=10G --numjobs=8 --filename=/dev/nvme1n1 --allow_mounted_write=1
[...]
Run status group 0 (all jobs):
WRITE: bw=9511MiB/s (9973MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=80.0GiB (85.9GB), run=8612-8613msec
As you can see, the expected throughput should be around 10 GiB/s as stated by fio, whereby I’m not sure how comparable the results of fio with gdsio are. The topology for the PCIe devices and in regards of the root complex is stored in “lspci.tv.txt”. I’m not sure whether the pci topology is causing the slow throughput but since GDS is working fine, it is likely that the data is redirected by the CPU to the storage and vice versa.
Can you please help and explain this circumstance?
cufile.json.txt (10.8 KB)
disk.by-path.txt (409 Bytes)
fio-read.txt (11.8 KB)
fio-write.txt (12.2 KB)
gds_stats.txt (260 Bytes)
gdscheck.py.txt (701 Bytes)
gdscheck.txt (2.3 KB)
gdsio.txt (1.9 KB)
lscpi.tv.txt (2.7 KB)
nv.topology.txt (290 Bytes)
mount.ext4.txt (237 Bytes)