Hello everyone,
I have 4 kernels, two working on the first device (AA,BB) and two on the second device (CC,DD). I would like to have F3_1 from the last iteration copied on the second device (in F3_2), but i also want to be sure that BB will not start until the copy is finished. Right now i am not able to have the two device working in parrallel but in can’t enforce my two conditions. I hope this makes sense, your help is really apreciated
I hope this makes sense.
Here is my code (simplified):
cudaSetDevice(0);
cudaStream_t s0;
cudaStreamCreate(&s0);
cudaSetDevice(1);
cudaStream_t s1;
cudaStreamCreate(&s1);
//First iteration
cudaSetDevice(0);
AA<<<block,threads,0,s0>>>(F1,F2);
BB<<<block,threads,0,s0>>>(F2,F3_1);
DD<<<block,threads,0,s1>>>(F4,F5);
//Other iterations
for(int t=1;t<T;t++)
{
cudaSetDevice(0);
AA<<<block,threads,0,s0>>>(F1,F2);
BB<<<block,threads,0,s0>>>(F2,F3_1);
cudaSetDevice(1);
cudaMemcpyPeerAsync(F3_2,1,F3_1,0,sizeF3,s1);
CC<<<block,threads,0,s1>>>(F4,F3_2);
DD<<<block,threads,0,s1>>>(F4,F5);
}
Robin