Understanding cudaMemcpyPeerAsync

Hi.

I’m trying to understand the behaviour of cudaMemcpyPeerAsync depending on the streams specified. With the help of the Visual Profiler I think I understand it when P2P is not enabled, but if I enable P2P communication the results I obtain are unexpected to me.

I used this (simplified) code with CUDA 5.0 and two GTX 550:

#define BYTES ( 1 << 25 )

int main( int argc, char* argv[] )
{
        int *send0, *send1, *recv0, *recv1;
        cudaStream_t st0, st1;

        cudaSetDevice( 0 );
        cudaDeviceEnablePeerAccess( 1, 0 );
        cudaStreamCreate( &st0 );
        cudaMalloc( &send0, 2 * BYTES );
        cudaMalloc( &recv0, BYTES );

        cudaSetDevice( 1 );
        cudaDeviceEnablePeerAccess( 0, 0 );
        cudaStreamCreate( &st1 );
        cudaMalloc( &send1, BYTES );
        cudaMalloc( &recv1, 2 * BYTES );

        cudaSetDevice(0);
        cudaMemsetAsync( send0, 0, 2 * BYTES, st0 );

        cudaSetDevice(1);
        cudaMemsetAsync( send1, 0, BYTES, st1 );

        cudaMemcpyPeerAsync( recv1, 1, send0, 0, 2 * BYTES, st1 );
        cudaMemcpyPeerAsync( recv0, 0, send1, 1, BYTES, st0 );
}

What is unexpected to me is:

  1. In the Visual Profiler the data transfer is not shown in the stream I passed as parameter (the stream in the receiving device), but in a new stream in the sending device. Since the new stream is not created by me, I cannot synchronize with it.
  2. Both data transfers begin at the same moment, when the longest kernel ends. Why? Since they run in new indepenedent streams they shouldn't wait for the kernels to end.

Can anyone explain me why cudaMemcpyPeerAsync has this behaviour?

Thanks a lot.

After reading the CUDA Programming guide again I found something that I had missed and explains why the copy is synchronized with the kernel executions: