Nvidia GDS issue on HGX A100 - Dell PowerEdge XE9680

I am experiencing a problem when attempting to write to an NVME (EXT4) SSD local disk. Below are the details of the issue:

Read Test:
Command:
/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 0 -I 0

Output:
IoType: READ XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB)
Throughput: 12.704382 GiB/sec, Avg_Latency: 76.674805 usecs ops: 1024 total_time 0.078713 secs

Write Test:
Command:
/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 0 -I 1

Output:
write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :1048576

I have attempted the same test with three additional disks attached to the server, using all eight GPUs in the system, and with several versions of drivers and Mellanox OFED software, but the result remains the same.
I am including detailed information about the system in the attachment to this message.

Report Nvidia GDS issue on HGX A100 - Dell PowerEdge XE9680.pdf (71.5 KB)
cufile.log (6.2 MB)

1 Like

Pls enable below by load nvidia_peermem.ko

–Mellanox PeerDirect : Enabled

Hello,
After runnning “sudo modprobe nvidia-peermem” the output of “gdscheck.py -p” show "–Mellanox PeerDirect : Enabled¨.
The results remain the same:

Read test:
/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 0 -I 0
IoType: READ XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 12.594617 GiB/sec, Avg_Latency: 77.337891 usecs ops: 1024 total_time 0.079399 secs

Write test:
/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 0 -I 1
write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :1048576

The ‘cufile.log’ from the tests is in attachment to this message.
cufile.log (1.4 MB)

Att.

The read throughput of 12GB/s looks interesting. Are these Gen5 drives? Let’s verify whether the read is also working fine or are these numbers misleading.
Can you try the following and share the output?

  1. Repeat the above tests with -x 1 and -x 2 mode in with -I 1(write) and -V (verify) option and see if they pass.
  2. If the above test passes then try to perform a read -I 0 with -x 0 mode and -V option. This verification should pass if the reads are working fine.
  3. Does a simple 4K write fail as well? If yes, can you try to run a simple 4K write test with kernel log for nvidia-fs enabled and share the dmesg output and cufile.log output?
    If 4K doesn’t fail, can you try 64K, 128K … to see where it fails first and attach that log?
    To enable, kernel logs:
    echo 1 > /sys/module/nvidia_fs/parameters/dbg_enabled.
    Thanks
    Sourab

Please check if checksum is enabled or not on these drives. If it’s enabled, please try to disable it and try again. Following command should be able to show that.
sudo nvme id-ns /dev/nvme -H
You may share the output here to confirm as well.

Hello Sourab,

The tests output:

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 1 -V -I 1
IoType: WRITE XferType: CPUONLY Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 2.089384 GiB/sec, Avg_Latency: 332.090820 usecs ops: 1024 total_time 0.478610 secs
Verifying data
IoType: READ XferType: CPUONLY Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 1.544094 GiB/sec, Avg_Latency: 631.155273 usecs ops: 1024 total_time 0.647629 secs

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 2 -V -I 1
IoType: WRITE XferType: CPU_GPU Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 3.635002 GiB/sec, Avg_Latency: 266.953125 usecs ops: 1024 total_time 0.275103 secs
Verifying data
IoType: READ XferType: CPU_GPU Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 1.885768 GiB/sec, Avg_Latency: 517.189453 usecs ops: 1024 total_time 0.530288 secs

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1G -x 0 -V -I 0
Error : files not created with -V mode or data verification failed at offset : 0x0(0) bs :1048576 failing index :0x1f590 tid: 0
Actual Data:4a4a4a4a4a4a4a4a4a4a4a4a4a4a4a4a
Expected Data:47445343484b300088ac0f0000000000

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg4K -d 4 -i 4K -s 4K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg4K -d 4 -s 4K -x 0 -V -I 1
data size has to be at least multiple of max block size used * worker_count. use -s 1G

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg4K -d 4 -i 4K -s 64K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg4K -d 4 -i 4K -s 128K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg128K -d 4 -i 4K -s 128K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg512K -d 4 -i 4K -s 512K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1024K -d 4 -i 4K -s 1024K -x 0 -V -I 1
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096

Thank you for your help!
Luciano
cufile.log (823.3 KB)
dmesg.log (1.2 MB)

Can you also check comment #5 and respond? That will rule out any NVMe related issue as GDS doesn’t support checksums today.

I couldn’t determine if the checksum is enabled.

The “sudo nvme id-ns /dev/nvme3n1 -H” output:

NVME Identify Namespace 1:
nsze : 0x37e3e92b0
ncap : 0x37e3e92b0
nuse : 0x73f7940
nsfeat : 0x1e
[4:4] : 0x1 NPWG, NPWA, NPDG, NPDA, and NOWS are Supported
[2:2] : 0x1 Deallocated or Unwritten Logical Block error Supported
[1:1] : 0x1 Namespace uses NAWUN, NAWUPF, and NACWU
[0:0] : 0 Thin Provisioning Not Supported

nlbaf : 1
flbas : 0
[4:4] : 0 Metadata Transferred in Separate Contiguous Buffer
[3:0] : 0 Current LBA Format Selected

mc : 0
[1:1] : 0 Metadata Pointer Not Supported
[0:0] : 0 Metadata as Part of Extended Data LBA Not Supported

dpc : 0
[4:4] : 0 Protection Information Transferred as Last 8 Bytes of Metadata Not Supported
[3:3] : 0 Protection Information Transferred as First 8 Bytes of Metadata Not Supported
[2:2] : 0 Protection Information Type 3 Not Supported
[1:1] : 0 Protection Information Type 2 Not Supported
[0:0] : 0 Protection Information Type 1 Not Supported

dps : 0
[3:3] : 0 Protection Information is Transferred as Last 8 Bytes of Metadata
[2:0] : 0 Protection Information Disabled

nmic : 0
[0:0] : 0 Namespace Multipath Not Capable

rescap : 0
[6:6] : 0 Exclusive Access - All Registrants Not Supported
[5:5] : 0 Write Exclusive - All Registrants Not Supported
[4:4] : 0 Exclusive Access - Registrants Only Not Supported
[3:3] : 0 Write Exclusive - Registrants Only Not Supported
[2:2] : 0 Exclusive Access Not Supported
[1:1] : 0 Write Exclusive Not Supported
[0:0] : 0 Persist Through Power Loss Not Supported

fpi : 0x80
[7:7] : 0x1 Format Progress Indicator Supported
[6:0] : 0 Format Progress Indicator (Remaining 0%)

dlfeat : 9
[4:4] : 0 Guard Field of Deallocated Logical Blocks is set to 0xFFFF
[3:3] : 0x1 Deallocate Bit in the Write Zeroes Command is Supported
[2:0] : 0x1 Bytes Read From a Deallocated Logical Block and its Metadata are 0x00

nawun : 65535
nawupf : 7
nacwu : 65535
nabsn : 65535
nabo : 0
nabspf : 7
noiob : 0
nvmcap : 7681501126656
npwg : 7
npwa : 7
npdg : 7
npda : 7
nows : 2047
nsattr : 0
nvmsetid: 0
anagrpid: 0
endgid : 1
nguid : fd5b42cebc8f9d54ace42e003609d38e
eui64 : ace42e003609d38e
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

Att.
Luciano

This seems okay. Metadata Size is 0 which says that checksum is not enabled on the drive. Will take a look at the rest of the logs and get back.
Just to confirm. The MOFED version you installed is MLNX_OFED_LINUX-24.01-0.3.3.1?
ofed_info -s should be able to confirm.

The MOFED version is MLNX_OFED_LINUX-24.01-0.3.3.1.
Complementary information:
nvidia-driver-550-open
nvidia-fabricmanager-550
nvidia-gds-12-4
Ubuntu 20.04.6 LTS

From the logs, because the read verification failed it looks like Read Verification has also failed which means reads are also not working right as well. Still trying to figure out what the issue could be. In the meanwhile, can you try few experiments.

  1. Check if Intel VT-D is enabled in the BIOS and if enabled try to disable it. This link might be handy for your system.
  2. If the experiments in #1 fail, then keep the VT-D and IOMMU both enabled and see. We are trying to see if the there is some issue with the pci bridge that it is not able to handle the P2P transcations.

Can you share following outputs as well?

  1. lspci -nn
  2. nvme error-log /dev/nvme

Hello,

Tests result with VT-D disabled:
/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1GB -x 0 -V -I 1
IoType: WRITE XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 2.605761 GiB/sec, Avg_Latency: 370.110352 usecs ops: 1024 total_time 0.383765 secs
Verifying data
IoType: READ XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 2.408518 GiB/sec, Avg_Latency: 403.853516 usecs ops: 1024 total_time 0.415193 secs

/usr/local/cuda-12.4/gds/tools/gdsio -f /home/m324528/mnt_nvme3/reg1G -d 4 -s 1GB -x 0 -V -I 0
IoType: READ XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 1024(KiB) Throughput: 2.385104 GiB/sec, Avg_Latency: 407.967773 usecs ops: 1024 total_time 0.419269 secs

Work for all 4 SSD drives!

Note: For VT-D and IOMMU enabled, the error occurs.

proc_cpuinfo.txt (202.6 KB)
dmesg_vt-d_off.log (488.1 KB)
cufile_vt-d_off.log (2.2 MB)
nvme3m1_error_log.txt (18.8 KB)

Att.