NVMeoF RDMA over IB intermittently fails with CQE local protection error 4 during gdsio cuFileWrite

Hi,

I run into a subtle issue with gdsio write against NVMe-oF RDMA drives. It intermittently causes mlx5 driver log the following error and force nvme controller to recover, fail gdsio.

[170519.273396] infiniband mlx5_0: dump_cqe:277:(pid 0): WC error: 4, Message: local protection error
[170519.273402] cqe_dump: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273404] cqe_dump: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273405] cqe_dump: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273406] cqe_dump: 00000030: 00 00 00 00 02 00 51 04 00 00 07 dc 00 01 43 e2
[170519.273412] nvme nvme12: RECV for CQE 0x0000000050e94c54 failed with status local protection error (4)

Hardware/Software configuration

  • Dell XE9680 with 8 H100 GPUs, 4 CX-7 IB NICs
  • IOMMU is off; ACS is disabled
  • DMA-BUF is used. (nvidia-peermem is not loaded)
  • RHEL9.5, kernel: 5.14.0-503.40.1.el9_5.x86_64
  • DOCA-OFED: MLNX_OFED_LINUX-25.07-0.9.7.0, CX-7 fw 28.46.3048
  • CUDA 13.0, Driver version: 580.65.06
  • GDS version: 1.15.0.42; nvidia_fs version: 2.26 libcufile version: 2.12

Output of “gdscheck.py -p”

GDS release version: 1.15.0.42
nvidia_fs version: 2.26 libcufile version: 2.12
Platform: x86_64

ENVIRONMENT:

CUFILE_ENV_PATH_JSON : /root/xzhou/LITCGIO/GDSIO/cufile.json

DRIVER CONFIGURATION:

NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
ScaTeFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
–Mellanox PeerDirect : Disabled
–rdma library : Not Loaded (libcufile_rdma.so)
–rdma devices : Not configured
–rdma_device_status : Up: 0 Down: 0

CUFILE CONFIGURATION:

properties.use_pci_p2pdma : true
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 1048576
properties.per_buffer_cache_size_kb : 4096
properties.max_device_pinned_mem_size_kb : 16777216
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 128
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : true
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.scatefs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 1
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 32
execution.max_io_queue_depth : 256
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false

GPU INFO:

GPU index 0 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 2 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 3 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 4 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 5 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 6 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 7 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled

PLATFORM INFO:

IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 13000
Platform: PowerEdge XE9680, Arch: x86_64(Linux 5.14.0-503.40.1.el9_5.x86_64)
Platform verification succeeded

I don’t see this CQE error when I run fio with libaio engine against the same set of NVMe-oF drives.
Any idea what may cause this issue?

Thanks

Here are more details regarding the CQE 0x4 error.

Native NVMe multipath is enabled with a round-robin load-balancing policy. The test system is configured with four paths (two per socket over InfiniBand). This CQE error is observed only during cuFileWrite operations (RDMA READ), and it occurs when the load-balancing policy is switched from numa to round-robin.

Notably, this issue cannot be reproduced on another test system that has two CX-6 adapters using RoCEv2.

Question:
Does GDS support NVMe-oF RDMA with multipath round-robin or queue-depth–based load-balancing policies?