2-way memcpy?

If you copy a block of memory from one device to another, the data generally goes through the host. As it happens, I’d actually like to leave a copy of it in the host memory.

I could, of course, just do a memcpy to the host first, followed by a memcpy from the host to the second device. However, this doubles the latency.

I could also break the block of memory into chunks, and run separate streams copying those chunks to the host and then from the host to the second device.

My preliminary experiments seem to show that this works OK. However, I’m wondering if there is a simpler solution, perhaps some API I don’t know about?

It is not clear to me what you are trying to accomplish. From what I understand, you want to copy some data from device A to device B, but also copy the same data from device A to the host? Ideally you want to do that simultaneously, i.e. basically a multicast operation.

I don’t think anything like that is supported. The closest you can probably come in terms of latency is to do a peer-to-peer transfer from device A to device B, followed by a transfer from device A to the host. Note that a number of limitations affect peer to peer transfers, check the CUDA documntation for the details. There should also be an example program in the collection of sample apps that ship with CUDA.

Yes. I’m also assuming there is no P2P.

Thanks

If P2P is not available, your current solution is fine: copy from device A to the host, then copy from the host to device B. You would want to use pinned host memory for those transfers if at all possible.

My solution so far was to do the two transfers (mostly) simultaneously. From the OP:

The copy from device A to host must be complete before you start the copy from host to device B, otherwise you have a race condition. Because of the data dependency, those two copies would want to be in the same stream.

I assume what you are doing is using two host buffers? While stream 1 copies from device A to host buffer #1, stream 2 copies from host buffer #2 to device B. In the next stage stream 1 copies from buffer #1 to device B, while stream 2 copies from device A to host buffer #2. The two stages repeat until the copy is complete.

This only applies to each chunk. Quoting from my own quote:

“I could also break the block of memory into chunks, and run separate streams copying those chunks to the host and then from the host to the second device.”

You could start copying chunk_1 from host to device B, while chunk_2 hasn’t been fully copied to the host yet, and so on.

“Quoting from my own quote”