I expected the dual-copy engines would allow the mem copies to run in parallel.
From NSIGHT, it does not seem that this is the case.
As an additional experiment, I modified Thread2 to kick-off a busy-wait Kernel. So Thread2 now queue the kernel, then the cudaMemcopyAsyn and finally doing a stream-synchronize.
Here I saw that a mem copy from Thread 1 would execute in parallel to kernel but then the next thread1 copy would be blocked until Thread2 mem copy finished.
Note: I do sometimes see an minor overlap between the two streams (copies).
Note: Since these are async calls, using NSIGHT, I can see all the requests coming in (and queued), a small delay, and then the copies and kernel start running.
- Why are the D2D mem copies not running in parallel?
- What role do the dual-copy engines play here? Are they strictly used for off-GPU (host) copies?
- Is it possible to have true simultaneous D2D mem copies?