Hello,
I’ve been unsuccessful deploying DOCA + GDS on red hat 9.5 with kernel version 5.14.0-503.33.1.el9_5.x86_64
. Any attempt to use gdscheck, gdsio, etc fails with a bad address error from the kernel, which also populates up to the application. I used the doca-extra provided kernel support package to generate the kernel modules for this host (8x A6000) (as a wishlist, it would be terrific if that process could be wrapped in DKMS)
[ +0.000001] nvidia-fs:rw_verify_area failed with -14
[ +0.000011] nvidia-fs:read IO failed :-14
[root@gpu0300 ~]# /usr/local/cuda-12.9/gds/tools/gdsio -f /mnt/nvme/test -d 0 -w 4 -s 1G -x 0 -i 4K:32K:1K -I 0
io failed of type 0 size: 4096 , ret: 0
failed to submit io of type 0 ret: -5
Error: IO failed stopping traffic, fd :207 ret:-5 errno :14
io failed :ret :-5 errno :14, file offset :0, block size :4096
From what I can see, all of the kernel modules have successfully loaded, and the platform test says things are OK
[root@gpu0300 ~]# /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
GDS release version: 1.11.1.6
nvidia_fs version: 2.25 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
<snip>
GPU index 0 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 2 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 3 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 4 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 5 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 6 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 7 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12060
Platform: G482-Z54-00, Arch: x86_64(Linux 5.14.0-503.33.1.el9_5.x86_64)
Platform verification succeeded
[root@gpu0300 ~]# lsmod | grep -e nv -e mln
nvidia_uvm 6918144 4
nvidia_drm 126976 0
nvidia_modeset 1556480 9 nvidia_drm
video 73728 1 nvidia_modeset
drm_kms_helper 274432 3 ast,nvidia_drm
nvidia_peermem 20480 0
nvidia_fs 323584 0
nvidia 9773056 66 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
drm 782336 6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm
ib_uverbs 229376 3 nvidia_peermem,rdma_ucm,mlx5_ib
nvme 73728 7
nvme_core 233472 8 nvme
nvme_auth 28672 1 nvme_core
mlx_compat 20480 14 rdma_cm,ib_ipoib,mlxdevm,nvme,mlxfw,iw_cm,nvme_core,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
t10_pi 20480 1 nvme_core
[root@gpu0300 ~]# /usr/local/cuda-12.9/gds/tools/gdscheck.py -V
FILESYSTEM VERSION CHECK:
Pre-requisite:
nvidia_peermem is loaded as required
GDS mode is enabled.
nvme module is loaded
nvme module is correctly patched
nvme-rdma module is not loaded
ScaleFlux module is not loaded
NVMesh module is not loaded
Lustre module is not loaded
BeeGFS module is not loaded
GPFS module is not loaded
rpcrdma module is not loaded
ofed_info:
current version: OFED-internal-25.04-0.6.1: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
Is there anything that stands out about why I would be receiving these errors?
Thanks!