Hello,
I installed gds on ubuntu20.04 and it seems to be correct.Here is my gdscheck -p result:
GDS release version: 1.14.1.1
nvidia_fs version: 2.25 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
ScaTeFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_pci_p2pdma : true
properties.use_compat_mode : false
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.per_buffer_cache_size_kb : 1024
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 64
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.scatefs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: true
fs.gpfs.gds_async_support: true
profile.nvtx : true
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12080
Platform: 30FNSBYL00, Arch: x86_64(Linux 5.15.0-139-generic)
Platform verification succeeded
I use gdsio to test gds,but I got:
/usr/local/cuda-12.4/gds/tools/gdsio -f /media/ct/nvme/dd.txt -d 0 -w 4 -s 10G -i 1M -I 0 -x 0 -V
Error : files not created with -V mode or data verification failed at offset : 0xa0000000(2684354560) bs :1048576 failing index :0x1eff0 tid: 1
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b3100887f0fa000000000
Error : files not created with -V mode or data verification failed at offset : 0x140000000(5368709120) bs :1048576 failing index :0x1f890 tid: 2
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b320088c40f4001000000
Error : files not created with -V mode or data verification failed at offset : 0x0(0) bs :1048576 failing index :0x1efd0 tid: 0
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b3000887e0f0000000000
Error : files not created with -V mode or data verification failed at offset : 0x1e0000000(8053063680) bs :1048576 failing index :0x1e670 tid: 3
Actual Data:41414141414141414141414141414141
Expected Data:47445343484b330088330fe001000000
sudo /usr/local/cuda-12.4/gds/tools/gdsio -f /media/ct/nvme/dd.txt -d 0 -w 4 -s 10G -i 1M -I 1 -x 0 -V
write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :8053063680, block size :1048576
write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :5, file offset :2684354560, block size :1048576
ret :-5 errno :5, file offset :5368709120, block size :1048576
write io failed of type 1 size: 1048576 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :1048576
The problem is similar to: Nvidia GDS issue on HGX A100 - Dell PowerEdge XE9680 - #14 by 271732480 . Differently, I tried to ensure the VT-D is disabled in bios and IOMMU is disabled and the problem still exists.
Here is some of my enveromment:
ubuntu 20.04 ,kernel:5.15.0-139-generic
MLNX_OFED_LINUX-5.8-7.0.6.1
nvidia-driver 570.181
nvidia-gds-12-4,
nvidia-fabricmanager-570/unknown,now 570.172.08-1 amd64
libnvidia-nscq-570/unknown,now 570.172.08-1 amd64
dmesg shows:
[ 3105.809681] blk_update_request: I/O error, dev nvme0n1, sector 1979531264 op 0x1:(WRITE) flags 0x8800 phys_seg 17 prio class 0
[ 3105.809735] nvidia-fs:write IO failed :-5
[ 3105.819265] blk_update_request: I/O error, dev nvme0n1, sector 1308442624 op 0x1:(WRITE) flags 0xc800 phys_seg 127 prio class 0
[ 3105.819285] blk_update_request: I/O error, dev nvme0n1, sector 637353984 op 0x1:(WRITE) flags 0xc800 phys_seg 127 prio class 0
[ 3105.819309] blk_update_request: I/O error, dev nvme0n1, sector 637355016 op 0x1:(WRITE) flags 0x8800 phys_seg 127 prio class 0
[ 3105.819317] blk_update_request: I/O error, dev nvme0n1, sector 1308443832 op 0x1:(WRITE) flags 0x8800 phys_seg 105 prio class 0
[ 3105.819340] nvidia-fs:write IO failed :-5
[ 3105.819365] nvidia-fs:write IO failed :-5
[ 3105.819591] blk_update_request: I/O error, dev nvme0n1, sector 344064 op 0x1:(WRITE) flags 0x8800 phys_seg 81 prio class 0
[ 3105.819625] nvidia-fs:write IO failed :-5