Can cudaMemcpyDeviceToDevice be used to move data between devices?

Or do I have to copy from device 1 to host and then from host to device 2?

I’m fairly certain DeviceToDevice can only copy within one device. It doesn’t work for copying from device 1 to device 2.

Unfortunately, yes. NVIDIA has been promising fast device to device memory copies for a long time now…

I believe that is not true. What I understand is that they currently pass by the host, and that NVIDIA is working on direct copies between two GPU devices, bypassing the host memory.

Hmm, then I guess I remember wrong. I thought they were working, just not fast yet…

NVIDIA provides nothing to automatically copy from one GPU to another, it is up to what the user does.

The user must implement it as a copy to GPU1-> host mem -> GPU2. This is not as fast as it could be because pinnned memory is only pinned to a particular GPU, so one of those two copies must be slow. It would be great if pinned memory works for all GPUs, but it doesn’t.

Even more ideal would be if there was a way to copy from one GPU directly to the other (i.e. over SLI or over the PCIe link currently being advertised in the 790i chipset or by some other method using PCIe).

I have no idea what NVIDIA has in mind for this, only that there have been 2 or 3 forum posts stating that “fast gpu to gpu transfers are under consideration for a future version of CUDA” or some such.

Is there any news about copying from a GPU to another?

Sidney Lima
Recife Brasil

Sorry to bring this back to life. Is there any guarantee that copy will occure through NVlink when using 2 (or more) gpu ? Where is the documentation related to this aspect ?

I don’t know what (or more) means.

It will take place over NVLink if there is a direct NVLink connection between the two GPUs in question, and you have properly enabled CUDA peer access between the 2 devices, and you use the cudaMemcpyPeer* family of functions.

See here and you can also just search for references to “peer” in the programming guide.

From that particular section:

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.