I have a hard time finding where exactly in the documentation is this behavior explained, so let me give a simple example: 2 GPUs, 2 streams on each one (gpu1s1, gpu1s2, gpu2s1, gpu2s2
):
On CPU thread #1:
k1_gpu1<<<..,gpu1s1>>>(..)
cudaMemcpyPeerAsync(GPU2->GPU1, gpu1s1);
At the same time (or even before), on CPU thread #2, some kernels were scheduled on GPU2:
k1_gpu2<<<..,gpu2s2>>>(..)
k2_gpu2<<<..,gpu2s2>>>(..)
k3_gpu2<<<..,gpu2s2>>>(..)
I want the gpu2s2
stream be completely independent of GPU1 and GPU1 cudaMemcpyPeer to synchronize only gpu2s1
and gpu1s1
stream. What will happen in above case? Will cudaMemcpyPeer synchronize both streams on GPU2, alongside gpu1s1?
Edit: this all came out from a confusion because the cudaMemcpyPeer API does not accept 2 stream parameters, only 1