We have a PCIe device with two x8 PCIe Gen3 endpoints which we are trying to interface to the Tesla V100, but are seeing subpar rates when using RDMA.
We are using a SuperMicro X11 motherboard with all the components located on the same CPU running any software with CUDA affinity for that CPU.
When transferring data from OUR device to/from host RAM over DMA we see rates at about 12 (from device) 15 (to device) GB/s, but when we try to use RDMA we achieve only about 10 GB/s to GPU from our device, and only 5 GB/s going from the GPU to our device.
The software interfaces we’ve built is virtually identical when going device-GPU and device-host (except the underlying use of buffers which are pre-allocated on the host when going host-device (dma), and just using the kernel-level nvidia_p2p_get_pages following a cudaMalloc initially when going gpu-device (rdma)) and various measurements we’ve done indicates no software-caused slowdown between the scenarios.
As a point of reference, we get rates of about 12-13 GB/s doing host-GPU benchmarks using pinned host memory with the same size buffers as we use when doing device-host / device-GPU dma/rdma (4 MB).
Any ideas on how to improve RDMA rates?