Hi,
I run into a subtle issue with gdsio write against NVMe-oF RDMA drives. It intermittently causes mlx5 driver log the following error and force nvme controller to recover, fail gdsio.
[170519.273396] infiniband mlx5_0: dump_cqe:277:(pid 0): WC error: 4, Message: local protection error
[170519.273402] cqe_dump: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273404] cqe_dump: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273405] cqe_dump: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[170519.273406] cqe_dump: 00000030: 00 00 00 00 02 00 51 04 00 00 07 dc 00 01 43 e2
[170519.273412] nvme nvme12: RECV for CQE 0x0000000050e94c54 failed with status local protection error (4)
Hardware/Software configuration
- Dell XE9680 with 8 H100 GPUs, 4 CX-7 IB NICs
- IOMMU is off; ACS is disabled
- DMA-BUF is used. (nvidia-peermem is not loaded)
- RHEL9.5, kernel: 5.14.0-503.40.1.el9_5.x86_64
- DOCA-OFED: MLNX_OFED_LINUX-25.07-0.9.7.0, CX-7 fw 28.46.3048
- CUDA 13.0, Driver version: 580.65.06
- GDS version: 1.15.0.42; nvidia_fs version: 2.26 libcufile version: 2.12
Output of “gdscheck.py -p”
GDS release version: 1.15.0.42
nvidia_fs version: 2.26 libcufile version: 2.12
Platform: x86_64ENVIRONMENT:
CUFILE_ENV_PATH_JSON : /root/xzhou/LITCGIO/GDSIO/cufile.json
DRIVER CONFIGURATION:
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
ScaTeFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
–Mellanox PeerDirect : Disabled
–rdma library : Not Loaded (libcufile_rdma.so)
–rdma devices : Not configured
–rdma_device_status : Up: 0 Down: 0CUFILE CONFIGURATION:
properties.use_pci_p2pdma : true
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 1048576
properties.per_buffer_cache_size_kb : 4096
properties.max_device_pinned_mem_size_kb : 16777216
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 128
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : true
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.scatefs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 1
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 32
execution.max_io_queue_depth : 256
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : falseGPU INFO:
GPU index 0 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 2 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 3 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 4 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 5 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 6 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 7 NVIDIA H100 80GB HBM3 bar:1 bar size (MiB):131072 supports GDS, IOMMU State: DisabledPLATFORM INFO:
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 13000
Platform: PowerEdge XE9680, Arch: x86_64(Linux 5.14.0-503.40.1.el9_5.x86_64)
Platform verification succeeded
I don’t see this CQE error when I run fio with libaio engine against the same set of NVMe-oF drives.
Any idea what may cause this issue?
Thanks