how to use concurrent copy and execute with multiple devices?

I’ve got multiple GPUs in one machine (a few more than two). I want to use concurrent copy and execute so that I can copy a buffer between GPUs while other kernels are running, but it’s not clear from the specs what happens regarding dependencies on the target GPU.

To be more specific, what I need to do:

  1. GPU 0 produces buffer A
  2. GPU 0 copies buffer A asynchronously to GPU 1
  3. GPU 0 starts producing buffer B
  4. GPU 1 copies buffer A to GPU 2 (once it’s ready for copy)
  5. GPU 0 copies buffer B to GPU 1
  6. GPU 1 copies buffer B to GPU 2
  7. GPU 1 starts processing based on buffers A and B

My problem is with dependencies on operations 4, 6, and 7. I need to know when the buffer finished moving so I can move it onward. I also need to know that I can start processing the said buffer.
cudaMemcpyPeer (and cudaMemcpy with UVA) block everything so concurrent copy and execute don’t work
cudaMemcpyPeerAsync only gets the stream on the current context, so there is no control on dependecies on the target GPU, so I don’t know when I can start working on the said buffer.
Making GPU1 pull from GPU0 solves the problem of knowing when data has arrived, but introduces a problem of knowing when data is ready.
Looks like what I need is some more comples synchronization mechanisms as in OpenCL, only OpenCL doesn’t know how to do peer memcpy in the first place.

Any ideas?

I hope that I’ve managed to pass the point … it actually get’s much more complex than this on the real system


Would cudaStreamWaitEvent() help with your problem?

This indeed seems like the way to go. I’m testing it now. The problem is that this computer runs linux so no nsight, and the visual profiler has some issues with showing whether concurrent copy and execute actually work.


Visual Profiler can correctly show concurrent copy and execute on the timeline. Which version of the CUDA Toolkit are you using?