If you copy a block of memory from one device to another, the data generally goes through the host. As it happens, I’d actually like to leave a copy of it in the host memory.
I could, of course, just do a memcpy to the host first, followed by a memcpy from the host to the second device. However, this doubles the latency.
I could also break the block of memory into chunks, and run separate streams copying those chunks to the host and then from the host to the second device.
My preliminary experiments seem to show that this works OK. However, I’m wondering if there is a simpler solution, perhaps some API I don’t know about?