RDMA GPU Direct Slow

raph38130 · January 29, 2019, 4:16pm

Hi all,

I am doing RDMA data transfer between workstation and a NVIDIA gpu. I am using RDMA RoCEv2 UD queue pair with SEND/RECEIVE verbs. Hardware is 8/16 cores x86_64, with mellanox connectx-5 100Gb/s , using direct fiber link (no switch, workstation to workstation), nvidia quadro p6000 on the same NUMA set

I do large transfer : 4096 bytes buffer, 4096*2 works requests, iterates 1000 times

sender to receiver using memory backed by hugepage (no gpu) : 97.4 Gb/ sustainable OK
sender to receiver using gpu memory mapped with nv_peer_mem kernel module, bandwidth starts around 70Gb/s ok BUT then fall slowly (in a couple of seconds) to 20 Gb/s BAD!!!
There are no packet drop (lively verified), but throughput decreases. Nothing visible on wireshark, with/without sniffer.

on this workstation, bandwidth test using host pinned memory (cuda sample) reports 12GB/s

any ideas ?

raph38130 · January 29, 2019, 4:26pm

slowdown on sending side is caused by Pause Frame notification but I am looking for the root cause.
the workstation has 48 PCIe lanes…

njuffa · January 29, 2019, 8:27pm

You might want to clarify the description of your setup. When you refer to sender/receiver, which of the two devices “mellanox connectx-5” and “nvidia quadro p6000” is the sender, which is the receiver?

12 GB/sec uni-directional throughput is perfectly normal for a PCIe gen3 x16 link, as you no doubt are aware.

raph38130 · January 30, 2019, 9:10am

both workstation have a connectx-5 100gb/s NIC
receiver has a Quadro P6000 on the same PCIe

I use ibv_post_send (IBV_SEND_WITH IMM) on sender side, and ibv_post_recv on receiver side. with UD queue pair there is no handshake.

Actually I found the culprit : dropless_rq flags on connectx5 was set to off. In that case, some packet may be dropped when processing incoming rdma datagram on receiver is slower than generating on sender side. To prevent data loss, the card emit Global Pause that slow down traffic.

with dropless_rq ON, incoming rdma requests are buffered in the NIC, even when there are no more outstanding receive requests, in order to be processed later. When this buffer is full of unprocessed requests, the card emit Global Pause but less often …
that is my understanding of the issue. Feel free to comment.

Question remains : is 72Gb/s the maximum achievable throughput from connectx to gpu device memory using gpudirect ? (Same code achieves 97.4Gb/s sustainable from mellanox to CPU memory)

My code is done in a way (double buffer) that there are always some receive requests in work queue ready for processing. Is there an hardware limit that may be the reason of this lower transfer throughput ?

njuffa · January 30, 2019, 11:34am

I can’t give you an authoritative answer since I haven’t used a setup like yours. But consider that

The bandwidth of your system memory is significantly higher than the bandwidth of the Mellanox device or the GPU. One channel of DDR4-2666 provides a theoretical bandwidth of 21.3 GB/sec and your system memory presumably comprises four of six such channels. That makes system memory a much faster data source/sink.

You may want to inquire with Mellanox regarding optimal configuration settings for RDMA usage with GPUs. There is a chance that an NVIDIA engineer with knowledge of this specialized subject matter may encounter this thread and be able to provide tips from the GPU perspective. Note that these forums are not designed as an NVIDIA support channel, so that chance is likely slim.

raph38130 · January 31, 2019, 5:02pm

thank you,

sure, it seems that bottleneck relies somewhere in PCIe architectural flaw.

I did not found any other info apart from this blog from D. Rosseti that notices the same bad performances :

if so, gpudirect is good for lower latency, not that much for throughput…

njuffa · January 31, 2019, 10:30pm

D. Rossetti is the expert on GPUdirect.

The entire point of GPUdirect is to lower latency. As far as throughput is concerned, any properly configured modern system can saturate a PCIe gen 3 x16 link (achieve 12+ GB/sec throughput per direction) via copies through system memory.

Since PCIe uses packetized transport and there are various fixed overheads, actual throughput achieved depends on the size of individual transfers. Typically, PCIe bandwidth is fully exploited once transfer size reaches 4 MB or so. It is easy enough to write a test program to explore the relationship between transfer size and PCIe throughput to/from the GPU; the CUDA sample app bandwidthTest may even be sufficient.

PCIe gen 4 is around the corner, but the way these things tend to go I would expect improved throughput but little to no reduction in overhead, so that throughput will still be poor for small transfers.

raph38130 · February 1, 2019, 10:03am

so, can we tell that this is because RDMA using RoCEv2 relies on 4k (4096 MTU) bytes maximum payload that PCIe transaction on sink is ‘slow’ to another PCIe device ?

njuffa · February 1, 2019, 10:44am

You would want to discuss this with an expert in PCIe. I am not one of them, I only know the basics.

The maximum payload size of individual low-level packets determines how much of the theoretical PCIe bandwidth is available to user data throughput. I vaguely recall that the maximum payload size supported by GPUs is 128 bytes, which results in an efficiency of 86% (the other 14% being taken up by headers). This reduces the maximum achievable bandwidth to about 12.5 GB/sec for a x16 link.

I am guessing that the 4K size you mention is the PCIe maximum read request size? I don’t know what role that plays. The transfer size I was referring to in my earlier post is the size of an entire high-level (application-level) transaction, such as a cudaMemcpy() operation. The sample app bandwidthTest that ships with CUDA can give a reasonable overview of the throughput achievable for various transaction sizes. On my platform I see this:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\demo_suite>bandwidthTest.exe --htod --mode=range --start=8192 --end=344064 --increment=8192
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro P2000
 Range Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   8192                         2149.6
   16384                        3717.5
   24576                        4845.0
   32768                        5681.8
   40960                        6367.8
   49152                        6976.7
   57344                        7390.2
   65536                        8097.2
   73728                        8081.9
   81920                        8311.2
   90112                        8672.3
   98304                        8368.2
   106496                       9147.2
   114688                       8678.4
   122880                       9375.0
   131072                       9535.2
   139264                       9681.1
   147456                       9816.8
   155648                       9895.8
   163840                       9998.0
   172032                       10174.4
   180224                       10202.2
   188416                       10302.8
   196608                       9945.3
   204800                       10389.0
   212992                       10455.2
   221184                       10535.4
   229376                       10607.7
   237568                       10611.8
   245760                       10697.5
   253952                       10711.8
   262144                       10443.9
   270336                       10523.0
   278528                       10582.7
   286720                       10922.5
   294912                       10943.6
   303104                       10991.0
   311296                       10995.4
   319488                       11079.5
   327680                       11075.4
   335872                       11111.1
   344064                       11122.9

For very large transfers I get almost the anticipated 12.5 GB/sec:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\demo_suite>bandwidthTest.exe
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro P2000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12366.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12462.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     119682.1

Result = PASS

raph38130 · February 1, 2019, 3:04pm

thank for your interesting comments, I try to get in touch with DR…

what I would like to know is why the full PCIe bandwidth is not achievable in the case of network card to gpu transfer. Surely because PCIe packets used in transport are too small, but why ? is that related to the fact that the size of incoming ethernet datagram are of limited size (4096 bytes in RoCE RDMA protocol) ?

raph38130 · February 13, 2019, 6:42pm

for those who have interest on this topics, I copy/paste the answer of D. Rossetti [DR], Nvidia engineer, gpudirect/nv_peer_mem specialist.

…
I understand that it is related to an hardware limits of PCIe implementation.

[DR] Note that on recent server-grade CPU, RC have improved a little bit on the P2P PCIe read front, but still very BW is observed.

It looks like that PCIe transaction from device(NIC) to cpu memory or from cpu memory to device(GPU) are faster than PCIe transaction from device (NIC) to device (GPU) and that should be related to payload size.

[DR] not necessarily related to the PCIe payload size. It should be more related to the amount of outstanding PCIe transactions that can be forwarded across the RC peer-to-peer data path.

Can we tell that pcie root complex is better at handling the relatively small RoCE packets (4096B) to cpu memory sink than to another pcie gpu device ?

[DR] Experimentally that is what we observe. Different CPU RCs have shown to have different capabilities and performance, i.e. AMD, Intel, IBM.
For you information, on NVIDIA DGX systems, we on purposely deploy PCIe switch chips so to achieve ~90% of the peak PCIe RDMA BW.

In other words, if we were using another RDMA protocol with larger datagram, would we get better gpudirect throughput ?

[DR] as I mentioned above, I don’t think so. If you need full P2P BW, you should consider using a motherboard with a PCIe switch.

Topic		Replies	Views
GPUDirect Performance : 25% less bandwidth than CudaMemcpy from host pinned memory Software And Drivers	6	359	February 13, 2019
Device Memory Bandwidth CUDA Programming and Performance	17	8199	January 17, 2018
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2749	April 11, 2019
How to Optimize Data Transfers in CUDA C/C++ Technical Blog	12	1184	January 22, 2022
GPU Communication Protocol CUDA Programming and Performance	16	6263	May 17, 2010
Why is the transfer throughput low when transferring small size data from Host to Device (or Device to Host)? CUDA Programming and Performance	8	2150	October 12, 2021
Bandwidht Usage CUDA Programming and Performance	16	8895	October 30, 2008
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68056	April 18, 2008
ConnectX-4 RoCE speed less than expected Ethernet Adapter Cards	7	1624	July 14, 2023
Unstable PCIe bandwidth CUDA Programming and Performance	6	1743	April 15, 2016

RDMA GPU Direct Slow

Related topics