I am doing RDMA data transfer between workstation and a NVIDIA gpu. I am using RDMA RoCEv2 UD queue pair with SEND/RECEIVE verbs. Hardware is 8/16 cores x86_64, with mellanox connectx-5 100Gb/s , using direct fiber link (no switch, workstation to workstation), nvidia quadro p6000 on the same NUMA set
I do large transfer : 4096 bytes buffer, 4096*2 works requests, iterates 1000 times
sender to receiver using memory backed by hugepage (no gpu) : 97.4 Gb/ sustainable OK
sender to receiver using gpu memory mapped with nv_peer_mem kernel module, bandwidth starts around 70Gb/s ok BUT then fall slowly (in a couple of seconds) to 20 Gb/s BAD!!!
There are no packet drop (lively verified), but throughput decreases. Nothing visible on wireshark, with/without sniffer.
on this workstation, bandwidth test using host pinned memory (cuda sample) reports 12GB/s
You might want to clarify the description of your setup. When you refer to sender/receiver, which of the two devices “mellanox connectx-5” and “nvidia quadro p6000” is the sender, which is the receiver?
12 GB/sec uni-directional throughput is perfectly normal for a PCIe gen3 x16 link, as you no doubt are aware.
both workstation have a connectx-5 100gb/s NIC
receiver has a Quadro P6000 on the same PCIe
I use ibv_post_send (IBV_SEND_WITH IMM) on sender side, and ibv_post_recv on receiver side. with UD queue pair there is no handshake.
Actually I found the culprit : dropless_rq flags on connectx5 was set to off. In that case, some packet may be dropped when processing incoming rdma datagram on receiver is slower than generating on sender side. To prevent data loss, the card emit Global Pause that slow down traffic.
with dropless_rq ON, incoming rdma requests are buffered in the NIC, even when there are no more outstanding receive requests, in order to be processed later. When this buffer is full of unprocessed requests, the card emit Global Pause but less often …
that is my understanding of the issue. Feel free to comment.
Question remains : is 72Gb/s the maximum achievable throughput from connectx to gpu device memory using gpudirect ? (Same code achieves 97.4Gb/s sustainable from mellanox to CPU memory)
My code is done in a way (double buffer) that there are always some receive requests in work queue ready for processing. Is there an hardware limit that may be the reason of this lower transfer throughput ?
I can’t give you an authoritative answer since I haven’t used a setup like yours. But consider that
The bandwidth of your system memory is significantly higher than the bandwidth of the Mellanox device or the GPU. One channel of DDR4-2666 provides a theoretical bandwidth of 21.3 GB/sec and your system memory presumably comprises four of six such channels. That makes system memory a much faster data source/sink.
You may want to inquire with Mellanox regarding optimal configuration settings for RDMA usage with GPUs. There is a chance that an NVIDIA engineer with knowledge of this specialized subject matter may encounter this thread and be able to provide tips from the GPU perspective. Note that these forums are not designed as an NVIDIA support channel, so that chance is likely slim.
The entire point of GPUdirect is to lower latency. As far as throughput is concerned, any properly configured modern system can saturate a PCIe gen 3 x16 link (achieve 12+ GB/sec throughput per direction) via copies through system memory.
Since PCIe uses packetized transport and there are various fixed overheads, actual throughput achieved depends on the size of individual transfers. Typically, PCIe bandwidth is fully exploited once transfer size reaches 4 MB or so. It is easy enough to write a test program to explore the relationship between transfer size and PCIe throughput to/from the GPU; the CUDA sample app bandwidthTest may even be sufficient.
PCIe gen 4 is around the corner, but the way these things tend to go I would expect improved throughput but little to no reduction in overhead, so that throughput will still be poor for small transfers.
so, can we tell that this is because RDMA using RoCEv2 relies on 4k (4096 MTU) bytes maximum payload that PCIe transaction on sink is ‘slow’ to another PCIe device ?
You would want to discuss this with an expert in PCIe. I am not one of them, I only know the basics.
The maximum payload size of individual low-level packets determines how much of the theoretical PCIe bandwidth is available to user data throughput. I vaguely recall that the maximum payload size supported by GPUs is 128 bytes, which results in an efficiency of 86% (the other 14% being taken up by headers). This reduces the maximum achievable bandwidth to about 12.5 GB/sec for a x16 link.
I am guessing that the 4K size you mention is the PCIe maximum read request size? I don’t know what role that plays. The transfer size I was referring to in my earlier post is the size of an entire high-level (application-level) transaction, such as a cudaMemcpy() operation. The sample app bandwidthTest that ships with CUDA can give a reasonable overview of the throughput achievable for various transaction sizes. On my platform I see this:
thank for your interesting comments, I try to get in touch with DR…
what I would like to know is why the full PCIe bandwidth is not achievable in the case of network card to gpu transfer. Surely because PCIe packets used in transport are too small, but why ? is that related to the fact that the size of incoming ethernet datagram are of limited size (4096 bytes in RoCE RDMA protocol) ?
for those who have interest on this topics, I copy/paste the answer of D. Rossetti [DR], Nvidia engineer, gpudirect/nv_peer_mem specialist.
…
I understand that it is related to an hardware limits of PCIe implementation.
[DR] Note that on recent server-grade CPU, RC have improved a little bit on the P2P PCIe read front, but still very BW is observed.
It looks like that PCIe transaction from device(NIC) to cpu memory or from cpu memory to device(GPU) are faster than PCIe transaction from device (NIC) to device (GPU) and that should be related to payload size.
[DR] not necessarily related to the PCIe payload size. It should be more related to the amount of outstanding PCIe transactions that can be forwarded across the RC peer-to-peer data path.
Can we tell that pcie root complex is better at handling the relatively small RoCE packets (4096B) to cpu memory sink than to another pcie gpu device ?
[DR] Experimentally that is what we observe. Different CPU RCs have shown to have different capabilities and performance, i.e. AMD, Intel, IBM.
For you information, on NVIDIA DGX systems, we on purposely deploy PCIe switch chips so to achieve ~90% of the peak PCIe RDMA BW.
In other words, if we were using another RDMA protocol with larger datagram, would we get better gpudirect throughput ?
[DR] as I mentioned above, I don’t think so. If you need full P2P BW, you should consider using a motherboard with a PCIe switch.