GPUDirect Performance : 25% less bandwidth than CudaMemcpy from host pinned memory

Dear all,

I am doing RDMA data transfer between workstation and a NVIDIA gpu. I am using RDMA RoCEv2 UD queue pair with SEND/RECEIVE verbs.

Hardware : mellanox connectx-5 100Gb/s , using direct fiber link (no switch, workstation to workstation), nvidia quadro p6000 in the same root complex on Intel(R) Xeon(R) Gold 6134 workstation.

performance measurement with perftest --use_cuda and custom application for large dataset : 4096 bytes buffer, 8192 WR, 1000 iterations :

  • Connectx5 NIC to CPUMEM (backed by hugepages) : 97.4 Gb/s OK

  • CPUMEM to GPU : 100Gb/s OK (cudamemcpy from host pinned memory)

but NIC to GPU (gpudirect / nv_peer_mem) only 74Gb/s 25% less than the maximum. What is the root cause of this bottleneck ? Are there any workaround ?

see below about architectural bottleneck (PCIe), but for sandybridge:

Did you try to check with ‘perf’ where application spends time?

There is also utility named ‘neo-host’ available from Mellanox site, that may help to get more details on the issue?

what about read/write and not send/receive?

btw, is this one a duplicate?

yes I did tests using mellanox perftest, using --use_cuda and get same results>

I am almost sure it is an hardware limits of DMA/PCIe (payload size smaller in device to device transaction than in device to memory or memory to device) but need and expert confirmation…

not exactly a duplicate…

during first tests, I experiment slowdown in network throughput I could not explain. Actually they are caused by PFC activation (priority flow control) because sink in GPU memory is to slow for source (NIC).

Now I want tho understand the root cause : why DMA transfer, i.e. Gpudirect from NIC (connectx5) to GPU memory is slower than DMA transfer (cudaMemcpyAsync) from CPU pinned memory to GPU memory…

I use SEND/RECEIVE as one of major requirement of my use case is in using unidirectional transaction.

Data source is an Image detector (until now a simulator), and detector electronics is embedded in FPGA. We can transmit UDP/RoCEv2 packets, but not receive anything.

So we are using UD queue pair and SEND/RECEIVE verbs.

BTW we are achieving almost 100Gb/s when using intermediate cpu memory buffer:

NIC->RDMA to CPU memory → RDMA to GPU memory → RDMA from GPU memory to CPU memory to store results (3 concurrent DMA)

I copy paste some input from nvidia engineer, the gpudirect/nv_peer_mem expert (D.Rossetti)

in short, to achieve P2P full BW, we have to consider motherboard with embedded PCIe switch.

Root Complex is a bottleneck for device P2P

I understand that it is related to an hardware limits of PCIe implementation.

[DR] Note that on recent server-grade CPU, RC have improved a little bit on the P2P PCIe read front, but still very BW is observed.

It looks like that PCIe transaction from device(NIC) to cpu memory or from cpu memory to device(GPU) are faster than PCIe transaction from device (NIC) to device (GPU) and that should be related to payload size.

[DR] not necessarily related to the PCIe payload size. It should be more related to the amount of outstanding PCIe transactions that can be forwarded across the RC peer-to-peer data path.

Can we tell that pcie root complex is better at handling the relatively small RoCE packets (4096B) to cpu memory sink than to another pcie gpu device ?

[DR] Experimentally that is what we observe. Different CPU RCs have shown to have different capabilities and performance, i.e. AMD, Intel, IBM.

For you information, on NVIDIA DGX systems, we on purposely deploy PCIe switch chips so to achieve ~90% of the peak PCIe RDMA BW.

In other words, if we were using another RDMA protocol with larger datagram, would we get better gpudirect throughput ?

[DR] as I mentioned above, I don’t think so. If you need full P2P BW, you should consider using a motherboard with a PCIe switch.