Multi-GPU & stream management.

Hello everyone,

I have 4 kernels, two working on the first device (AA,BB) and two on the second device (CC,DD). I would like to have F3_1 from the last iteration copied on the second device (in F3_2), but i also want to be sure that BB will not start until the copy is finished. Right now i am not able to have the two device working in parrallel but in can’t enforce my two conditions. I hope this makes sense, your help is really apreciated

I hope this makes sense.

Here is my code (simplified):

cudaSetDevice(0);
cudaStream_t s0;
cudaStreamCreate(&s0);

cudaSetDevice(1);
cudaStream_t s1;
cudaStreamCreate(&s1);

//First iteration 
cudaSetDevice(0);
AA<<<block,threads,0,s0>>>(F1,F2);
BB<<<block,threads,0,s0>>>(F2,F3_1); 
DD<<<block,threads,0,s1>>>(F4,F5);

//Other iterations
for(int t=1;t<T;t++)
{
	cudaSetDevice(0);
	AA<<<block,threads,0,s0>>>(F1,F2);
	BB<<<block,threads,0,s0>>>(F2,F3_1); 

	cudaSetDevice(1);
	cudaMemcpyPeerAsync(F3_2,1,F3_1,0,sizeF3,s1);

	CC<<<block,threads,0,s1>>>(F4,F3_2);
	DD<<<block,threads,0,s1>>>(F4,F5);
}

Robin

What about using events?

For example, in

cudaStream_t streamA, streamB; 
cudaEvent_t eventA, eventB; 
 
cudaSetDevice( 0 ); 
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0 
cudaEventCreaet( &eventA ); 
 
cudaSetDevice( 1 ); 
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1 
cudaEventCreate( &eventB ); 
 
kernel<<<..., streamB>>>(...); 
cudaEventRecord( eventB, streamB ); 
 
cudaSetDevice( 0 ); 
cudaEventSynchronize( eventB ); 
kernel<<<..., streamA>>>(...);

device 0 will not start executing the kernel until device 1 finishes its kernel.

Thank you for your answer. I didn’t know it was possible to associate a event with a device. I kind of see how i could use that to do what I want.