GDS on NVMe-oF (RDMA) reports "No matching pair for network device to closest GPU" although RDMA devices are up

Hello, all

I’m unable to get GPUDirect Storage (GDS) to use the direct RDMA path on a non-DGX server with NVMe-oF over RDMA.
The GPU and IB NICs are NUMA-local and IOMMU is disabled, but I keep seeing the error below when running

Environment

  • GPU: NVIDIA Tesla T4 (0000:af:00.0, NUMA node 1)
  • NICs: Mellanox ConnectX-4
    • mlx5_0ibs3192.168.1.101
    • mlx5_1ibs5192.168.1.201
  • Software:
    • GDS 1.7.0.149
    • nvidia_fs 2.16
    • libcufile 2.12
    • MLNX_OFED installed
  • Kernel driver: mlx5_core
  • IOMMU: Disabled (cat /proc/cmdline confirms intel_iommu=off)

ibstat Output

CA 'mlx5_0'
    State: Active
    Physical state: LinkUp
    Rate: 100
    Link layer: InfiniBand

CA 'mlx5_1'
    State: Active
    Physical state: LinkUp
    Rate: 100
    Link layer: InfiniBand

Test Command

./gdsio -f /dev/nvme0n1 -d 0 -w 4 -s 100G -i 1M -I 0 -x 0

Error (in cufile.log)

ERROR  cufio-dr:226 No matching pair for network device to closest GPU found in the platform

GDS Check

GDS release version: 1.7.0.149
 nvidia_fs version:  2.16 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 CUFILE_ENV_PATH_JSON : /etc/cufile.json
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 WekaFS             : Supported
 Userspace RDMA     : Supported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Loaded (libcufile_rdma.so)
 --rdma devices        : Configured
 --rdma_device_status  : Up: 2 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 1
 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P 
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 Tesla T4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

/etc/cufile.json

"properties": {
    "rdma_dev_addr_list": ["192.168.1.101", "192.168.1.201"],
    "gds_rdma_write_support": true,
    "rdma_load_balancing_policy": "RoundRobin",
    "rdma_dynamic_routing": true,
    "rdma_dynamic_routing_order": ["GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P"]
}