Low performance on V100 to/from RDMA device

Hello.

We have a PCIe device with two x8 PCIe Gen3 endpoints which we are trying to interface to the Tesla V100, but are seeing subpar rates when using RDMA.

We are using a SuperMicro X11 motherboard with all the components located on the same CPU running any software with CUDA affinity for that CPU.

When transferring data from OUR device to/from host RAM over DMA we see rates at about 12 (from device) 15 (to device) GB/s, but when we try to use RDMA we achieve only about 10 GB/s to GPU from our device, and only 5 GB/s going from the GPU to our device.

The software interfaces we’ve built is virtually identical when going device-GPU and device-host (except the underlying use of buffers which are pre-allocated on the host when going host-device (dma), and just using the kernel-level nvidia_p2p_get_pages following a cudaMalloc initially when going gpu-device (rdma)) and various measurements we’ve done indicates no software-caused slowdown between the scenarios.

As a point of reference, we get rates of about 12-13 GB/s doing host-GPU benchmarks using pinned host memory with the same size buffers as we use when doing device-host / device-GPU dma/rdma (4 MB).

Any ideas on how to improve RDMA rates?

I cannot make sense out of the performance data with my limited knowledge of PCIe. Maybe you could clarify a few points about your setup to help the next, more knowledgeable, person who reads this thread.

For sufficiently large transfers, GPUs achieve 12-13 GB/sec uni-directional transfer rates to/from the host because they use a PCIe gen3 x16 link. The packet size used by GPUs is either 128 or 256, I do not recall off the top of my head. GPU transfer rates are optimal based on the packet size used.

How exactly does your device interface? Are the two PCIe gen3 x8 links somehow ganged together to effectively achieve x16 transfer rates? Not sure how that would work; I cannot recall having seen such a setup before.

How do you achieve 15 GB/sec transfer rates from the host? Does the device use much larger packet sizes than those used by the GPU? When your device communicates with the GPU, it presumably does so using a single PCIe gen 3 x8 link at the packet size the GPU uses, so I do not understand how you could get a 10 GB/sec uni-directional transfer rate.

1 Like

I guess I didn’t express myself well. Yes, for all intents and purposes they are “ganged” to get x16 to the same device memory (Xilinx Ultrascale-based), we’re capable of 512, but normally neg at 256 packet size. (The V100 is MaxPayload 256 bytes.)

So in our case, to get to the nitty gritty, we allocate two GPU memory ranges (cudaMalloc), and translate them (nvidia_p2p_get_pages) then iterate over the allocated GPU memory using the maximum size DMA transaction block we can (which is 4MB) with cudaMemcpy. Each GPU memory range will be interfacing with one of the endpoints.

The 10GB/s number comes from each endpoint performing at about 5GB/s.

interestingly, when doing this to one GPU device, we get ~10 GB/s / ~5 GB/s, but if we set each endpoint to interface with two separate GPUs we see ~2 GB/s (!) to GPU, ~8 GB/s from, but that’s a secondary issue for us at the moment.

EDIT: As an additional datapoint, we are seeing rates of ~10GB/s when going GPU<->GPU (cudaMemcpyPeer)

Even for 512-byte packets, the reported 15 GB/sec host->“your device” transfer rate seems phenomenal, as the speed-of-light rate for a PCIe gen3 x16 link at infinite packet size would be 15.75 GB/sec.

Your question definitely requires a PCIe expert for an authoritative answer. Maybe someone from NVIDIA will read this and follow up internally.

It’s pretty good, sure, but these are empty buffers we’re transferring to exercise the DMA limits, in a real-life scenario we would only see 14+ or so (~7+~7).