CUDA stream management

Hi All,

I am trying to run multiple kernels simultaneously as follows:

void* buf_a, buf_a1, buf_b, buf_b1;
// ... initialize buf_a and buf_b ...
cudaDeviceSynchronize();

for (int i=0; i<maxIter; ++i)
{
     computeKernelA<<<,,,stream1>>>(someout_a, buf_a);
     computeKernelB<<<,,,stream1>>>(someout_b, buf_b, someout_a);
     // ... do something with someout_b ...
     bufKernelA<<<,,,stream2>>>(buf_a1);
     bufKernelB<<<,,,stream3>>>(buf_b1);
     swap(buf_a1, buf_a);
     swap(buf_b1, buf_b);
}

First, buf_a and buf_b will be initialized. Within the loop, computeKernelA will be invoked with stream1 to do some process on the data stored in buf_a. Then, stream1 is again used to invoke computeKernelB because there is data dependancy. In the meantime, stream2 and stream3 will run bufKernelA and bufKernelB, respectively, to fetch some data in advance for next iteration.

The simplest way to do this correctly is put cudaDeviceSynchronize() before calling swap functions. But, it causes very long latency before starting next iteration. Also, I want to execute computeKernelA for next iteration as soon as bufKernelA on stream2 is completed; while computeKernelB for next iteration need to wait for the result from computeKernelA as well as the completion of bufKernelB on stream3 invoked in previous iteration.

Can I do this without using cudaDeviceSynchronize()?

Thanks!

Yes. You should use the CUDA events to synchronize between your streams. Please read about cuda streams and you will know what to do:
[url]http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf[/url]