I’m not sure 0-copy will help - what you have is GPU memory bandwidth being divided between PCIe transactions and kernel accesses. It doesn’t really make a difference whether PCIe transactions are due to cudaMemcpy(Async) calls or 0-copy accesses.
Now, it is most likely that overlapping kernels and memcopies is beneficial to the app. Say, your kernel achieves 100 GB/s throughput, whereas PCIe copy achieves 6 GB/s. If we assume that kernel takes roughly the same amount of time as a memcopy, say 100s, then here are the choices:
concurrent (overlapped) kernel and memcopy. Together they still achieve 100 GB/s, so kernel is observed to achiave 94 GB/s, taking 106 s. So, total time is 106 s.
kernel then memcopy. Total time is now 200 s.
So, for this case overlapping the kernel and memcopy is a clear win - 106 vs 200 seconds. You get the idea, now you just have to plug in your specific numbers.