Async Memcpy Sharing bandwidth with Kernel?


I’ve got a question: are async memcpys sharing device memory bandwidth with kernels? I ask because if I have a situation where using async memcpys can save me ~10% of execution time, but my bandwidth limited kernel is getting also 10% slower as a result it might not be worth the trouble.


Yes, they share the same total bandwidth.
In your case zerocopy might help you to squeeze the final 10% of execution time out of your kernel as that adds the PCIe bandwidth on top.

I’m not sure 0-copy will help - what you have is GPU memory bandwidth being divided between PCIe transactions and kernel accesses. It doesn’t really make a difference whether PCIe transactions are due to cudaMemcpy(Async) calls or 0-copy accesses.

Now, it is most likely that overlapping kernels and memcopies is beneficial to the app. Say, your kernel achieves 100 GB/s throughput, whereas PCIe copy achieves 6 GB/s. If we assume that kernel takes roughly the same amount of time as a memcopy, say 100s, then here are the choices:

    concurrent (overlapped) kernel and memcopy. Together they still achieve 100 GB/s, so kernel is observed to achiave 94 GB/s, taking 106 s. So, total time is 106 s.

    kernel then memcopy. Total time is now 200 s.

So, for this case overlapping the kernel and memcopy is a clear win - 106 vs 200 seconds. You get the idea, now you just have to plug in your specific numbers.