Hello, all
I’m unable to get GPUDirect Storage (GDS) to use the direct RDMA path on a non-DGX server with NVMe-oF over RDMA.
The GPU and IB NICs are NUMA-local and IOMMU is disabled, but I keep seeing the error below when running
Environment
- GPU: NVIDIA Tesla T4 (
0000:af:00.0, NUMA node 1) - NICs: Mellanox ConnectX-4
mlx5_0→ibs3→192.168.1.101mlx5_1→ibs5→192.168.1.201
- Software:
- GDS 1.7.0.149
nvidia_fs2.16- libcufile 2.12
- MLNX_OFED installed
- Kernel driver:
mlx5_core - IOMMU: Disabled (
cat /proc/cmdlineconfirmsintel_iommu=off)
ibstat Output
CA 'mlx5_0'
State: Active
Physical state: LinkUp
Rate: 100
Link layer: InfiniBand
CA 'mlx5_1'
State: Active
Physical state: LinkUp
Rate: 100
Link layer: InfiniBand
Test Command
./gdsio -f /dev/nvme0n1 -d 0 -w 4 -s 100G -i 1M -I 0 -x 0
Error (in cufile.log)
ERROR cufio-dr:226 No matching pair for network device to closest GPU found in the platform
GDS Check
GDS release version: 1.7.0.149
nvidia_fs version: 2.16 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
CUFILE_ENV_PATH_JSON : /etc/cufile.json
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Supported
Userspace RDMA : Supported
--Mellanox PeerDirect : Enabled
--rdma library : Loaded (libcufile_rdma.so)
--rdma devices : Configured
--rdma_device_status : Up: 2 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : false
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 1
properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 Tesla T4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
/etc/cufile.json
"properties": {
"rdma_dev_addr_list": ["192.168.1.101", "192.168.1.201"],
"gds_rdma_write_support": true,
"rdma_load_balancing_policy": "RoundRobin",
"rdma_dynamic_routing": true,
"rdma_dynamic_routing_order": ["GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P"]
}