I am doing a asynchronous memcpy from gpu0 to gpu1 using cudaMemcpyPeerAsync().
cudaMemcpyAsync() provides option for stream to use for gpu0, but not for gpu1. Can I somehow define the stream of the receiving device too?
I am using OpenMP threads to manage each of the devices (so, they are in separate context).
Visual Profiler shows the stream for sending device but for receiving device, this memcpy is just shown in the MemCpy (PtoP) and not in any of the streams (not even in the default stream)
PS: My current implementation works fine. I just want to overlap the sending and receiving communication.