Possibility to do d2d memcpy w/o CPU or w/o PCIe?


Id like to do some device to device memcopies without involving the CPU (I know that host to device transfers or vice versa arent possible without). As far as I know you need to do that by calling cudaMemcpy with specifying the cudaMemcpyDeviceToDevice parameter. But this function gets called by the CPU, what I would like to avoid, so is there a way to do a memcpy by device only? If not, is it planned to offer this in future CUDA releases? An official statement or point to such one would be very nice here.
Also I’m wondering if you can copy from device to device with some physical connection like e.g. a SLI bridge between your CUDA cards (know CUDA and SLI are different things) without using the PCIe.

No, it is not possible to make CUDA API calls (cudaMemcpy or anything else) from the device. The CPU always has to be involved to some extent.

You can’t transfer arbitrary data over the SLI bridge (it’s just a digital video connection).

Why do you want to avoid the CPU and PCIe bus?

We were just hoping for a little performance gain when doing 2d2 copies. Could it be possible to do the whole transaction in this case without the CPU in future releases?

I think you have fundamentally misunderstood what the cudaMemcpyDeviceToDevice flag means. In the runtime API it literally means “copy where source and destination memory reside in the same GPU context (ie. in the memory of the same GPU)”. It is a convenient way for the CPU to move data around inside a GPUs memory without needing to run a kernel. In device code you have pointers and can move memory around anyway you like without needing any API calls at all. This is in contrast to the other memcpy options, literally “copy where only one of the source and destination memory reside in a GPU context, and the other in the host memory”.

The basic premises of CUDA contexts are that they are associated with exactly one GPU and one host thread at a time. None of this has anything to do with managing memory on multiple GPUs, which it seems is what you are implicitly asking about.

Thanks for your statement.

You are right, Im looking for a possibilty to transfer memory from one GPU to another without needing the CPU. In the app im going to write the memcpy makes up a significant part of the whole run time. Yet I havent looked into multi-GPU programming, cause one GPU is enough for that app. But as far as I know it seems like a bigger bottleneck if I will use a second one later on, cause then I need to memcpy from host to device once per GPU. Thus it would be nice to have the option to directly copy from one GPU to another one without needing the CPU to do any processing for.