is it possible to overlap computation with a device-to-device memcopy?

Hello fellow CUDA programmers,

I have a quick question for anyone who is currently overlapping computation and memcopies. Is it possible to overlap the execution of a kernel with a device-to-device memcopy? The programming guide makes it clear that this IS possible for device-to-host memcopies, as well as host-to-device memcopies.

However it is not clear to me what will happen if you I try to pass the cudaMemcpyDeviceToDevice constant to the memcopy call being overlapped.

The reason I care about this is because I want to copy a large amount of data from global memory into a 3D texture and I have plenty of computations that do not depend on the resulting 3D texture. If I could overlap the memcopy with these computations I would certainly see a performance benefit.

Has anyone tried this? :)


Actually never mind. The programming guide is quite clear that this is not possible.

  1. Overlapping Kernel Execution and Memory copies is “card” dependent. Examine the device property to find out.

  2. Check *async calls, pinned memory usage et al to look @ overlapping CPU with memory copies