GDS read/write error:use gdsio -I 1 to write

Hello,

I installed gds on ubuntu20.04 and it seems to be correct.Here is my gdscheck -p result:

 GDS release version: 1.14.1.1
 nvidia_fs version:  2.25 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 ScaTeFS            : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : true
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.per_buffer_cache_size_kb : 1024
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.scatefs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: true
 fs.gpfs.gds_async_support: true
 profile.nvtx : true
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12080
 Platform: 30FNSBYL00, Arch: x86_64(Linux 5.15.0-139-generic)
 Platform verification succeeded

I use gdsio to test gds,but I got:

 /usr/local/cuda-12.4/gds/tools/gdsio -f /media/ct/nvme/dd.txt -d 0 -w 4 -s 10G -i 1M -I 0 -x 0 -V
Error : files not created with -V mode or data verification failed at offset : 0xa0000000(2684354560) bs :1048576 failing index :0x1eff0 tid: 1 
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b3100887f0fa000000000
Error : files not created with -V mode or data verification failed at offset : 0x140000000(5368709120) bs :1048576 failing index :0x1f890 tid: 2 
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b320088c40f4001000000
Error : files not created with -V mode or data verification failed at offset : 0x0(0) bs :1048576 failing index :0x1efd0 tid: 0 
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b3000887e0f0000000000
Error : files not created with -V mode or data verification failed at offset : 0x1e0000000(8053063680) bs :1048576 failing index :0x1e670 tid: 3 
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b330088330fe001000000
sudo /usr/local/cuda-12.4/gds/tools/gdsio -f /media/ct/nvme/dd.txt -d 0 -w 4 -s 10G -i 1M -I 1 -x 0 -V
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :8053063680, block size  :1048576
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :5, file offset :2684354560, block size  :1048576
ret :-5 errno :5, file offset :5368709120, block size  :1048576
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size  :1048576

The problem is similar to: Nvidia GDS issue on HGX A100 - Dell PowerEdge XE9680 - #14 by 271732480 . Differently, I tried to ensure the VT-D is disabled in bios and IOMMU is disabled and the problem still exists.

Here is some of my enveromment:

ubuntu 20.04 ,kernel:5.15.0-139-generic
MLNX_OFED_LINUX-5.8-7.0.6.1
nvidia-driver 570.181
nvidia-gds-12-4,
nvidia-fabricmanager-570/unknown,now 570.172.08-1 amd64
libnvidia-nscq-570/unknown,now 570.172.08-1 amd64

dmesg shows:

[ 3105.809681] blk_update_request: I/O error, dev nvme0n1, sector 1979531264 op 0x1:(WRITE) flags 0x8800 phys_seg 17 prio class 0
[ 3105.809735] nvidia-fs:write IO failed :-5
[ 3105.819265] blk_update_request: I/O error, dev nvme0n1, sector 1308442624 op 0x1:(WRITE) flags 0xc800 phys_seg 127 prio class 0
[ 3105.819285] blk_update_request: I/O error, dev nvme0n1, sector 637353984 op 0x1:(WRITE) flags 0xc800 phys_seg 127 prio class 0
[ 3105.819309] blk_update_request: I/O error, dev nvme0n1, sector 637355016 op 0x1:(WRITE) flags 0x8800 phys_seg 127 prio class 0
[ 3105.819317] blk_update_request: I/O error, dev nvme0n1, sector 1308443832 op 0x1:(WRITE) flags 0x8800 phys_seg 105 prio class 0
[ 3105.819340] nvidia-fs:write IO failed :-5
[ 3105.819365] nvidia-fs:write IO failed :-5
[ 3105.819591] blk_update_request: I/O error, dev nvme0n1, sector 344064 op 0x1:(WRITE) flags 0x8800 phys_seg 81 prio class 0
[ 3105.819625] nvidia-fs:write IO failed :-5

Besides,I found sudo nvme id-ns /dev/nvme0n1 -H:

sudo nvme id-ns /dev/nvme0n1 -H
NVME Identify Namespace 1:
nsze    : 0xe8e088b0
ncap    : 0xe8e088b0
nuse    : 0x5045c00
nsfeat  : 0
  [4:4] : 0	NPWG, NPWA, NPDG, NPDA, and NOWS are Not Supported
  [2:2] : 0	Deallocated or Unwritten Logical Block error Not Supported
  [1:1] : 0	Namespace uses AWUN, AWUPF, and ACWU
  [0:0] : 0	Thin Provisioning Not Supported

nlbaf   : 0
flbas   : 0
  [4:4] : 0	Metadata Transferred in Separate Contiguous Buffer
  [3:0] : 0	Current LBA Format Selected

mc      : 0
  [1:1] : 0	Metadata Pointer Not Supported
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported

dpc     : 0
  [4:4] : 0	Protection Information Transferred as Last 8 Bytes of Metadata Not Supported
  [3:3] : 0	Protection Information Transferred as First 8 Bytes of Metadata Not Supported
  [2:2] : 0	Protection Information Type 3 Not Supported
  [1:1] : 0	Protection Information Type 2 Not Supported
  [0:0] : 0	Protection Information Type 1 Not Supported

dps     : 0
  [3:3] : 0	Protection Information is Transferred as Last 8 Bytes of Metadata
  [2:0] : 0	Protection Information Disabled

nmic    : 0
  [0:0] : 0	Namespace Multipath Not Capable

rescap  : 0
  [6:6] : 0	Exclusive Access - All Registrants Not Supported
  [5:5] : 0	Write Exclusive - All Registrants Not Supported
  [4:4] : 0	Exclusive Access - Registrants Only Not Supported
  [3:3] : 0	Write Exclusive - Registrants Only Not Supported
  [2:2] : 0	Exclusive Access Not Supported
  [1:1] : 0	Write Exclusive Not Supported
  [0:0] : 0	Persist Through Power Loss Not Supported

fpi     : 0x80
  [7:7] : 0x1	Format Progress Indicator Supported
  [6:0] : 0	Format Progress Indicator (Remaining 0%)

dlfeat  : 1
  [4:4] : 0	Guard Field of Deallocated Logical Blocks is set to 0xFFFF
  [3:3] : 0	Deallocate Bit in the Write Zeroes Command is Not Supported
  [2:0] : 0x1	Bytes Read From a Deallocated Logical Block and its Metadata are 0x00

nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 2000398934016
nsattr	: 0
nvmsetid: 0
anagrpid: 0
endgid  : 1
nguid   : 00000000000000000000000000000000
eui64   : 0025384a21414e15
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

[4:4] : 0 NPWG, NPWA, NPDG, NPDA, and NOWS are Not Supported
[2:2] : 0 Deallocated or Unwritten Logical Block error Not Supported
[1:1] : 0 Namespace uses AWUN, AWUPF, and ACWU
[0:0] : 0 Thin Provisioning Not Supported.

Does this indicate that the problem stems from my NVMe SSD?

nvidia-fs 2.25.6 has a bug with p2p.

  • Using nvidia-fs version 2.25.6 may cause cuFile API failures when GDS peer-to-peer (P2P) mode is enabled with the nvidia-fs kernel driver. We recommend upgrading to nvidia-fs version 2.25.7 to resolve this issue.

use latest nvidia-fs-dkms driver https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fs-dkms_2.25.7-1_amd64.deb

Thanks a lot for your reply. I did use nvidia-fs 2.25.7 before. Now I tried to use nvidia-fs 2.20, but the problem still exists.Here is my nvidia-fs now:

nvidia-fs-dkms/unknown,now 2.20.6-1 amd64 [installed,upgradable to: 2.25.7-1]
nvidia-fs/unknown,now 2.20.6-1 amd64 [installed,upgradable to: 2.25.7-1]

Here is my previous nvidia-fs:

nvidia-fs-dkms/unknown,now 2.25.7-1 amd64 [installed]
nvidia-fs/unknown,now 2.25.7-1 amd64 [installed]

@kmodukuri Sorry to bother you again, but do you have any solutions? I tried reinstalling the Ubuntu system, and now the NVMe device shows “unsupported”.