How does cudaMemcpyPeer(Async) work with streams?

I have a hard time finding where exactly in the documentation is this behavior explained, so let me give a simple example: 2 GPUs, 2 streams on each one (gpu1s1, gpu1s2, gpu2s1, gpu2s2):

On CPU thread #1:

k1_gpu1<<<..,gpu1s1>>>(..)
cudaMemcpyPeerAsync(GPU2->GPU1, gpu1s1);

At the same time (or even before), on CPU thread #2, some kernels were scheduled on GPU2:

k1_gpu2<<<..,gpu2s2>>>(..)
k2_gpu2<<<..,gpu2s2>>>(..)
k3_gpu2<<<..,gpu2s2>>>(..)

I want the gpu2s2 stream be completely independent of GPU1 and GPU1 cudaMemcpyPeer to synchronize only gpu2s1 and gpu1s1 stream. What will happen in above case? Will cudaMemcpyPeer synchronize both streams on GPU2, alongside gpu1s1?

Edit: this all came out from a confusion because the cudaMemcpyPeer API does not accept 2 stream parameters, only 1

cudaMemcpyPeerAsync(dst, dstId, src, srcId, bytes, stream) will start if all preceding work in the stream is complete. It has the usual stream-ordering semantics.

If src or dst are in use in a different stream, you need to create the inter-stream dependency via synchronization or cudaStreamWaitEvent.

cudaSetDevice(1);
cudaMallocAsync(ptr1, bytes, gpu1stream);

cudaSetDevice(2);
cudaMallocAsync(ptr2, bytes, gpu2stream);
fillKernel<<<gpu2stream>>>(ptr2, bytes);

//copy from ptr2 to ptr1

//error, ptr2 may not be ready
cudaSetDevice(1);
cudaMemcpyPeerAsync(ptr1, 1, ptr2, 2, bytes, gpu1stream);

// error, ptr1 may not be ready
cudaSetDevice(2);
cudaMemcpyPeerAsync(ptr1, 1, ptr2, 2, bytes, gpu2stream);

//ok
cudaSetDevice(1);
cudaStreamSynchronize(gpu1stream);
cudaSetDevice(2);
cudaMemcpyPeerAsync(ptr1, 1, ptr2, 2, bytes, gpu2stream);

//ok
cudaSetDevice(1);
cudaEventRecord(event, gpu1stream);
cudaSetDevice(2);
cudaStreamWaitEvent(gpu2stream, event, 0);
cudaMemcpyPeerAsync(ptr1, 1, ptr2, 2, bytes, gpu2stream);

(From my observations, the API overhead is less if the srcStream is used instead of dstStream.)