Hello fellow CUDA programmers,
I have a quick question for anyone who is currently overlapping computation and memcopies. Is it possible to overlap the execution of a kernel with a device-to-device memcopy? The programming guide makes it clear that this IS possible for device-to-host memcopies, as well as host-to-device memcopies.
However it is not clear to me what will happen if you I try to pass the cudaMemcpyDeviceToDevice constant to the memcopy call being overlapped.
The reason I care about this is because I want to copy a large amount of data from global memory into a 3D texture and I have plenty of computations that do not depend on the resulting 3D texture. If I could overlap the memcopy with these computations I would certainly see a performance benefit.
Has anyone tried this? :)