Failing to use gds on A4000

Hello,
I recently upgraded my system to use kernel 6.11 on ubuntu 24.04.2
Before upgrading, I had mlnex-ofed installed and could use cufile_samples_* without any issues.

But recently after upgrading, I see,

./cufile_sample_001 /mnt/nvme/bar 0
opening file /mnt/nvme/bar
registering device memory of size :131072
writing from device memory
write failed : Operation not permitted
deregistering device memory
buffer deregister failed:device pointer lookup failure

I thought I needed an upgrade. I uninstalled the drivers, mlnx-ofed everything.
Installed doca-extra and doca-ofed. And did some more stuff and again got NVMe as supported.

Here is my complete gdscheck.py -p

GDS release version: 1.13.1.3
 nvidia_fs version:  2.24 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : true
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12080
 Platform: S2600STB, Arch: x86_64(Linux 6.11.0-19-generic)
 Platform verification succeeded

Here is cufile.log after running ./cufile_sample_001 /mnt/nvme/bar 0 and having TRACE as logging level.
cufile.log (74.1 KB)

The nvme is mounted as data=ordered
/dev/nvme0n1p1 /mnt/nvme ext4 defaults,data=ordered 0 2 in fstab

It used to work, I have not idea why it has broken down.

There are similar issues [1] [2]
But those are both because of usage of non Quadro or Tesla GPU. But I believe that A4000 is supported

Hi,
Thanks for sharing cufile.log. It seems the error is coming from the nvidia_fs driver. Just to make sure, I would suggest to recompile nvidia_fs driver and load that on the new upgraded kernel. If you still see the issue, please do the following and share dmesg output and /var/log/kern.log file.

echo 3 | sudo tee /sys/module/nvidia_fs/parameters/dbg_enabled

Hey, I removed nvidia-fs and reinstalled it. I still have no success :(
Here are my files after enabling your debugging command.
mydmesg.log (196.3 KB
kern.log (3.0 MB)

@rmitra Sorry for the pings, but any tips ahead?

Just making sure, did you recompile the driver before installing it ? Here is the GitHub link for the driver: GitHub - NVIDIA/gds-nvidia-fs: NVIDIA GPUDirect Storage Driver.

@rmitra
Yes I did, it did not work :(
After running ./cufile_sample_001 /mnt/nvme/bar 0
mydmesg.log (48.4 KB)
cufile.log (75.6 KB)

Also, one thing I observed
export CONFIG_MOFED_VERSION=$(ofed_info -s | cut -d '-' -f 2) [From build guide of nvidia_fs]
This results in

$ echo $CONFIG_MOFED_VERSION
internal

Is this supposed to be correct? I installed doca-ofed using doca-extra.

Also, thanks a lot for your time and response.