Understanding cudaMemcpyPeerAsync

igurrutxaga · February 25, 2014, 12:15pm

Hi.

I’m trying to understand the behaviour of cudaMemcpyPeerAsync depending on the streams specified. With the help of the Visual Profiler I think I understand it when P2P is not enabled, but if I enable P2P communication the results I obtain are unexpected to me.

I used this (simplified) code with CUDA 5.0 and two GTX 550:

#define BYTES ( 1 << 25 )

int main( int argc, char* argv[] )
{
        int *send0, *send1, *recv0, *recv1;
        cudaStream_t st0, st1;

        cudaSetDevice( 0 );
        cudaDeviceEnablePeerAccess( 1, 0 );
        cudaStreamCreate( &st0 );
        cudaMalloc( &send0, 2 * BYTES );
        cudaMalloc( &recv0, BYTES );

        cudaSetDevice( 1 );
        cudaDeviceEnablePeerAccess( 0, 0 );
        cudaStreamCreate( &st1 );
        cudaMalloc( &send1, BYTES );
        cudaMalloc( &recv1, 2 * BYTES );

        cudaSetDevice(0);
        cudaMemsetAsync( send0, 0, 2 * BYTES, st0 );

        cudaSetDevice(1);
        cudaMemsetAsync( send1, 0, BYTES, st1 );

        cudaMemcpyPeerAsync( recv1, 1, send0, 0, 2 * BYTES, st1 );
        cudaMemcpyPeerAsync( recv0, 0, send1, 1, BYTES, st0 );
}

What is unexpected to me is:

In the Visual Profiler the data transfer is not shown in the stream I passed as parameter (the stream in the receiving device), but in a new stream in the sending device. Since the new stream is not created by me, I cannot synchronize with it.
Both data transfers begin at the same moment, when the longest kernel ends. Why? Since they run in new indepenedent streams they shouldn't wait for the kernels to end.

Can anyone explain me why cudaMemcpyPeerAsync has this behaviour?

Thanks a lot.

igurrutxaga · February 25, 2014, 3:44pm

After reading the CUDA Programming guide again I found something that I had missed and explains why the copy is synchronized with the kernel executions:

Topic		Replies	Views
How to define destination device stream in cudaMemcpyPeerAsync()? CUDA Programming and Performance	0	694	September 22, 2013
How does cudaMemcpyPeer(Async) work with streams? CUDA Programming and Performance	1	498	September 25, 2023
how to use concurrent copy and execute with multiple devices? CUDA Programming and Performance	3	1191	March 20, 2012
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1758	October 17, 2009
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2372	November 18, 2011
asyncEngineCount and peer-to-peer copies CUDA Programming and Performance	2	1099	March 5, 2012
Questions on Streams CUDA Programming and Performance	5	2160	July 16, 2008
How to use streams for asynch transfers CUDA Programming and Performance	3	893	February 18, 2011
cudaMemcpyAsync not giving any answers using cudaMemcpyAsync function CUDA Programming and Performance	1	813	September 5, 2011
cudaMemcpyAsync CUDA Programming and Performance	10	21345	October 16, 2015

Understanding cudaMemcpyPeerAsync

Related topics