Bad Address error when using GDS

Hello,

I’ve been unsuccessful deploying DOCA + GDS on red hat 9.5 with kernel version 5.14.0-503.33.1.el9_5.x86_64. Any attempt to use gdscheck, gdsio, etc fails with a bad address error from the kernel, which also populates up to the application. I used the doca-extra provided kernel support package to generate the kernel modules for this host (8x A6000) (as a wishlist, it would be terrific if that process could be wrapped in DKMS)

[  +0.000001] nvidia-fs:rw_verify_area failed with -14
[  +0.000011] nvidia-fs:read IO failed :-14

[root@gpu0300 ~]# /usr/local/cuda-12.9/gds/tools/gdsio -f /mnt/nvme/test -d 0 -w 4 -s 1G -x 0 -i 4K:32K:1K -I 0
io failed of type 0 size: 4096 , ret: 0 
failed to submit io of type 0 ret: -5 
Error: IO failed stopping traffic, fd :207 ret:-5 errno :14
io failed :ret :-5 errno :14, file offset :0, block size  :4096

From what I can see, all of the kernel modules have successfully loaded, and the platform test says things are OK

[root@gpu0300 ~]# /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.25 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
<snip>
 GPU index 0 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 4 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 5 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 6 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 GPU index 7 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12060
 Platform: G482-Z54-00, Arch: x86_64(Linux 5.14.0-503.33.1.el9_5.x86_64)
 Platform verification succeeded

[root@gpu0300 ~]# lsmod | grep -e nv -e mln
nvidia_uvm           6918144  4
nvidia_drm            126976  0
nvidia_modeset       1556480  9 nvidia_drm
video                  73728  1 nvidia_modeset
drm_kms_helper        274432  3 ast,nvidia_drm
nvidia_peermem         20480  0
nvidia_fs             323584  0
nvidia               9773056  66 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
drm                   782336  6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm
ib_uverbs             229376  3 nvidia_peermem,rdma_ucm,mlx5_ib
nvme                   73728  7
nvme_core             233472  8 nvme
nvme_auth              28672  1 nvme_core
mlx_compat             20480  14 rdma_cm,ib_ipoib,mlxdevm,nvme,mlxfw,iw_cm,nvme_core,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
t10_pi                 20480  1 nvme_core

[root@gpu0300 ~]# /usr/local/cuda-12.9/gds/tools/gdscheck.py -V
FILESYSTEM VERSION CHECK:
Pre-requisite:
nvidia_peermem is loaded as required
GDS mode is enabled.
nvme module is loaded
nvme module is correctly patched
nvme-rdma module is not loaded
ScaleFlux module is not loaded
NVMesh module is not loaded
Lustre module is not loaded
BeeGFS module is not loaded
GPFS module is not loaded
rpcrdma module is not loaded
ofed_info:
current version: OFED-internal-25.04-0.6.1: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1

Is there anything that stands out about why I would be receiving these errors?

Thanks!

Can you please check if you are using 2.25.6 version of nvidia-fs(cat /proc/driver/nvidia-fs/stats)? If that’s the case, then please update to 2.25.7 from github directly since there was a recent regression with this version.

I certainly am…

[root@gpu0300 ~]# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.14.0.28 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.25.6)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Let me build/install from github and report back. Thanks for the response!

Hi @sougupta - thanks for the pointer, with the updated version in DKMS (manually uninstalling the rpm-installed one), GDS works

[root@gpu0300 ~]# /usr/local/cuda-12.9/gds/tools/gdsio -f /mnt/nvme/test -d 0 -w 4 -s 1G -x 0 -i 4K:32K:1K -I 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 259/1024(KiB) IOSize: 4-32-1(KiB) Throughput: 0.066757 GiB/sec, Avg_Latency: 4296.000000 usecs ops: 22 total_time 0.003700 secs

[root@gpu0300 ~]# dkms status
Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.25.6/source/dkms.conf)
Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.7/source/dkms.conf)
nvidia-fs/2.25.6, 5.14.0-503.33.1.el9_5.x86_64, x86_64: built
nvidia-fs/2.7, 5.14.0-503.33.1.el9_5.x86_64, x86_64: installed (Differences between built and installed modules)
nvidia-open/560.35.05, 5.14.0-503.33.1.el9_5.x86_64, x86_64: installed

What is the timescale to push a new version of the RPMs out with the fix? I can attempt to change our ansible to do the dkms-orchestration, but it would be much simpler to wait for an RPM. Alternately, is there a previous “known-good” version it makes sense to revert to?

Thanks!

I’m a bit curious, if I check the commit history, it appears there are no changes except for the access_ok check in configure. Was that the root-cause of the error I mentioned above?

Yes, that was the root cause and with 12.9 Update 1 it is fixed which is now released as well.

@sougupta I feel I’m also facing the same issue.

 cat /proc/driver/nvidia-fs/stats
GDS Version: 1.14.0.33 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.25.7)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                           : err=0 io_state_err=0
Sparse Reads                    : n=0 io=0 holes=0 pages=0 
Writes                          : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                            : n=2 ok=2 err=0 munmap=2
Bar1-map                        : n=2 ok=0 err=2 free=0 callbacks=0 active=0 delay-frees=0
Error                           : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                             : Read=0 Write=0 BatchIO=0

I have installed using doca.

 GDS release version: 1.14.1.1
 nvidia_fs version:  2.25 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 ScaTeFS            : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : false
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 1024
 properties.max_device_cache_size_kb : 131072
 properties.per_buffer_cache_size_kb : 1024
 properties.max_device_pinned_mem_size_kb : 18014398509481980
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.scatefs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 1024
 execution.max_request_parallelism : 0
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12090
 Platform: S2600STB, Arch: x86_64(Linux 6.11.11+)
 Platform verification succeeded

And on dmesg I see

[ +38.314705] nvidia-fs:nvfs_mgroup_pin_shadow_pages:397 Unable to pin shadow buffer pages 32 ret= -14
[  +0.009145] nvidia-fs:nvfs_map:1509 Error nvfs_setup_shadow_buffer
[  +0.007312] nvidia-fs:nvfs_mgroup_pin_shadow_pages:397 Unable to pin shadow buffer pages 256 ret= -14
[  +0.009223] nvidia-fs:nvfs_map:1509 Error nvfs_setup_shadow_buffer

Can you please help me out here.

@utkarsh02t This is a separate issue in 6.11 kernel and onwards. This is being worked upon and fix will be posted to github soon. If possible, please use an older kernel to make progress.

1 Like

@sougupta response is much appreciated.
Which repository should I keep a watch on?

1 Like