Hello,
I recently upgraded my system to use kernel 6.11 on ubuntu 24.04.2
Before upgrading, I had mlnex-ofed installed and could use cufile_samples_* without any issues.
But recently after upgrading, I see,
./cufile_sample_001 /mnt/nvme/bar 0
opening file /mnt/nvme/bar
registering device memory of size :131072
writing from device memory
write failed : Operation not permitted
deregistering device memory
buffer deregister failed:device pointer lookup failure
I thought I needed an upgrade. I uninstalled the drivers, mlnx-ofed everything.
Installed doca-extra and doca-ofed. And did some more stuff and again got NVMe as supported.
Here is my complete gdscheck.py -p
GDS release version: 1.13.1.3
nvidia_fs version: 2.24 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_pci_p2pdma : true
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 64
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12080
Platform: S2600STB, Arch: x86_64(Linux 6.11.0-19-generic)
Platform verification succeeded
Here is cufile.log after running ./cufile_sample_001 /mnt/nvme/bar 0
and having TRACE as logging level.
cufile.log (74.1 KB)
The nvme is mounted as data=ordered
/dev/nvme0n1p1 /mnt/nvme ext4 defaults,data=ordered 0 2
in fstab
It used to work, I have not idea why it has broken down.
There are similar issues [1] [2]
But those are both because of usage of non Quadro or Tesla GPU. But I believe that A4000 is supported