2-way memcpy?

alexgg · April 14, 2015, 9:13pm

If you copy a block of memory from one device to another, the data generally goes through the host. As it happens, I’d actually like to leave a copy of it in the host memory.

I could, of course, just do a memcpy to the host first, followed by a memcpy from the host to the second device. However, this doubles the latency.

I could also break the block of memory into chunks, and run separate streams copying those chunks to the host and then from the host to the second device.

My preliminary experiments seem to show that this works OK. However, I’m wondering if there is a simpler solution, perhaps some API I don’t know about?

njuffa · April 14, 2015, 9:56pm

It is not clear to me what you are trying to accomplish. From what I understand, you want to copy some data from device A to device B, but also copy the same data from device A to the host? Ideally you want to do that simultaneously, i.e. basically a multicast operation.

I don’t think anything like that is supported. The closest you can probably come in terms of latency is to do a peer-to-peer transfer from device A to device B, followed by a transfer from device A to the host. Note that a number of limitations affect peer to peer transfers, check the CUDA documntation for the details. There should also be an example program in the collection of sample apps that ship with CUDA.

alexgg · April 16, 2015, 2:48am

Yes. I’m also assuming there is no P2P.

Thanks

njuffa · April 16, 2015, 4:12am

If P2P is not available, your current solution is fine: copy from device A to the host, then copy from the host to device B. You would want to use pinned host memory for those transfers if at all possible.

alexgg · April 16, 2015, 4:40am

My solution so far was to do the two transfers (mostly) simultaneously. From the OP:

njuffa · April 16, 2015, 4:52am

The copy from device A to host must be complete before you start the copy from host to device B, otherwise you have a race condition. Because of the data dependency, those two copies would want to be in the same stream.

I assume what you are doing is using two host buffers? While stream 1 copies from device A to host buffer #1, stream 2 copies from host buffer #2 to device B. In the next stage stream 1 copies from buffer #1 to device B, while stream 2 copies from device A to host buffer #2. The two stages repeat until the copy is complete.

alexgg · April 16, 2015, 5:29am

This only applies to each chunk. Quoting from my own quote:

“I could also break the block of memory into chunks, and run separate streams copying those chunks to the host and then from the host to the second device.”

You could start copying chunk_1 from host to device B, while chunk_2 hasn’t been fully copied to the host yet, and so on.

little_jimmy · April 16, 2015, 2:20pm

“Quoting from my own quote”

Topic		Replies	Views
Data copy between multi-GPUs CUDA Programming and Performance	2	1580	October 14, 2008
Copy data from device to another device multiple GPUs CUDA Programming and Performance	1	5390	December 3, 2009
Inter-device copying CUDA Programming and Performance	2	876	May 25, 2010
how to share data between two GPU? CUDA Programming and Performance	3	1850	July 11, 2009
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2534	February 6, 2008
MemCopy Problem with CUDA Can't copy data CUDA Programming and Performance	2	2632	January 10, 2008
Possibility to do d2d memcpy w/o CPU or w/o PCIe? CUDA Programming and Performance	4	5049	May 19, 2010
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5301	April 2, 2008
Memory Questions Question about texture Memory and copy memory to another device.... CUDA Programming and Performance	1	1001	May 11, 2009
Host to multiple device transfers CUDA Programming and Performance	0	2306	January 20, 2012

2-way memcpy?

Related topics