I’m currently attempting to overlap the memory copy operations between CPU and GPU with GPU kernel execution. However, I’ve noticed that doing so causes delays in both operations, as discussed in this forum thread.
Is there any API or technique available to isolate the bandwidth between memory copy operations and GPU kernel execution in order to maintain the original performance?
The original performance is presumably the kernel duration when it is running without any other activity. It is probably a memory bound code that is using approximately the full memory bandwidth. Then when you add a copy operation, some of the memory bandwidth gets used by the copy operation so the kernel runs slower.
No, there is no hidden reserve of memory bandwidth that you can tap into, so that the kernel can run as if it had full bandwidth, even when some other activity is using some of the bandwidth. And otherwise partitioning or isolation would simply reduce the bandwidth to the kernel anyway.
If you really want the kernel to be at full speed as a top priority, then put the copy operations in question in the same stream as the kernel. That will force those copy operations to run before, or after, but not during the kernel execution.
You can always reduce the needed memory bandwidth (not only for copies, but also during the kernel run) by reading and writing in an artificially small range of memory only, so that your working set fully fits into the L2 cache. Of course you would not get the correct numeric results as output, but it could help for determining bottlenecks or theoretical speed numbers for the isolated computational part, if that is what you want.
Regarding this, I want to estimate how much GPU bandwidth my kernel is using.
For example, let’s assume a simple vector addition kernel that calculates a + b = c.
If each of a, b, and c is an int vector corresponding to 50MB and this kernel takes 0.2ms,
is it reasonable to say that this kernel uses approximately 150MB * (1000/0.2) = 750000MB = 750GB/s of bandwidth?
That sounds right. You can check with Compute Nsight, too. Also when talking about GB, the difference between GB and GiB is 8%. But feel free to ignore that for a first estimation.
There are some interfaces, which can do either read or write, and some interfaces which can do both in parallel.
Global memory within the GPU has to do read and write accesses serially, PCIe communication to/from the host can do full-duplex concurrent reads and writes. Consider this distinction, when comparing to official numbers.