I have a question about the launch overhead for cudaMemcpyPeerAsync when not using P2P access. The kernel launch overhead for large (>5 MB) transfers is on the order of 100-200us, which is much more than a typical kernel launch or cudaMemcpyAsync of ~5us. My understanding of what is actually happening is that two host-pinned buffers has to be allocated somewhere to allow for the pipelining of the communication. From my estimates the buffer appears to be ~2MB which is believable. Is this the root cause of the long delay? The launch latency really matters when you are trying to launch many p2pasync calls transferring data between many GPUs.
Yes, for the reasons you indicate, a device-to-device cudaMemcpyPeerAsync in a non-P2P environment is going to have additional overhead. There are actually 2 “transfers”, one from device to host, and the other from host to device. If it were me, I would not refer to it as “kernel launch overhead” but I understand what you mean. Since the buffer(s) allocated is not equal to the size of the transfer in all cases, the “transfer” I refer to above may actually be multiple transfers:
Device1 → Host (2MB
Host → Device2 (2MB)
Device1 → Host (2MB)
…
(yes, I acknowledge things may be double-buffered, the above is not intended to be an exact depiction of the sequence, but merely to point out that there are multiple steps involved)
If each of those steps incurs ~5us “launch” latency, then the overall latency of the transfer could add up based on transfer size. Of course we expect this for transfers in general. The larger the transfer, the longer it takes to complete. But if it were me, I would not assume that the scaling behavior of a single device->host cudaMemcpyAsync transfer to pinned memory is equivalent to the scaling behavior of cudaMemcpyPeerAsync device to device, in a non P2P setting.
Thanks for your detailed response. Yes, I have noticed that the larger-sized transfers have a larger “launch overhead” which would be consistent with internally having multiple pipelined DtoH and HtoD transfers. I actually ended up only using cudaMemcpyPeerAsync when there are exactly 2GPUs so the “kernel launch overhead” isn’t a huge deal.
Thanks,
Gaetan