cudaMemcpyPeerAsync Launch Overhead

Gaetan · October 4, 2023, 3:59pm

I have a question about the launch overhead for cudaMemcpyPeerAsync when not using P2P access. The kernel launch overhead for large (>5 MB) transfers is on the order of 100-200us, which is much more than a typical kernel launch or cudaMemcpyAsync of ~5us. My understanding of what is actually happening is that two host-pinned buffers has to be allocated somewhere to allow for the pipelining of the communication. From my estimates the buffer appears to be ~2MB which is believable. Is this the root cause of the long delay? The launch latency really matters when you are trying to launch many p2pasync calls transferring data between many GPUs.

Thanks,
Gaetan

Robert_Crovella · October 4, 2023, 4:16pm

Yes, for the reasons you indicate, a device-to-device cudaMemcpyPeerAsync in a non-P2P environment is going to have additional overhead. There are actually 2 “transfers”, one from device to host, and the other from host to device. If it were me, I would not refer to it as “kernel launch overhead” but I understand what you mean. Since the buffer(s) allocated is not equal to the size of the transfer in all cases, the “transfer” I refer to above may actually be multiple transfers:

Device1 → Host (2MB
Host → Device2 (2MB)
Device1 → Host (2MB)
…
(yes, I acknowledge things may be double-buffered, the above is not intended to be an exact depiction of the sequence, but merely to point out that there are multiple steps involved)

If each of those steps incurs ~5us “launch” latency, then the overall latency of the transfer could add up based on transfer size. Of course we expect this for transfers in general. The larger the transfer, the longer it takes to complete. But if it were me, I would not assume that the scaling behavior of a single device->host cudaMemcpyAsync transfer to pinned memory is equivalent to the scaling behavior of cudaMemcpyPeerAsync device to device, in a non P2P setting.

Gaetan · October 4, 2023, 5:45pm

Hi Robert

Thanks for your detailed response. Yes, I have noticed that the larger-sized transfers have a larger “launch overhead” which would be consistent with internally having multiple pipelined DtoH and HtoD transfers. I actually ended up only using cudaMemcpyPeerAsync when there are exactly 2GPUs so the “kernel launch overhead” isn’t a huge deal.
Thanks,
Gaetan

Topic		Replies	Views
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3205	September 1, 2009
Erratic multi-gpu bandwidth CUDA Programming and Performance	8	2690	June 25, 2015
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	136	August 22, 2024
Host to Device memcpy overhead CUDA Programming and Performance	2	1149	March 17, 2009
Kernel launch latency and streams CUDA Programming and Performance	4	1212	September 13, 2017
Cuda Dynamic Parallelism Launch Overhead CUDA Programming and Performance	5	2184	March 17, 2017
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3601	March 28, 2017
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16492	July 31, 2008
CudaMemcpyAsync wait long time to launch CUDA Programming and Performance cuda , kernel	8	2057	April 11, 2022
Concurrent Kernel Launching to Hide Kernel Launching Overhead (Not only kernel execution)) CUDA Programming and Performance	0	402	April 9, 2020

cudaMemcpyPeerAsync Launch Overhead

Related topics