Dear all,
I am doing RDMA data transfer between workstation and a NVIDIA gpu. I am using RDMA RoCEv2 UD queue pair with SEND/RECEIVE verbs.
Hardware : mellanox connectx-5 100Gb/s , using direct fiber link (no switch, workstation to workstation), nvidia quadro p6000 in the same root complex on Intel(R) Xeon(R) Gold 6134 workstation.
performance measurement with perftest --use_cuda and custom application for large dataset : 4096 bytes buffer, 8192 WR, 1000 iterations :
-
Connectx5 NIC to CPUMEM (backed by hugepages) : 97.4 Gb/s OK
-
CPUMEM to GPU : 100Gb/s OK (cudamemcpy from host pinned memory)
but NIC to GPU (gpudirect / nv_peer_mem) only 74Gb/s 25% less than the maximum. What is the root cause of this bottleneck ? Are there any workaround ?
see below about architectural bottleneck (PCIe), but for sandybridge:
https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/