Running several streams asynchronously


I have N matrices that should be processed.
The results in each matrix does not depend on the results from the other N-1 matrices.
They are all independent.
In order to work with streams I want to create N streams and then run the processing in each of them.

cudaStreamCreate ( &stream1) ;
kernel_1 <<< grid, block, 0, stream1 >>> ( …, dev2, … ) ;

kernel_Last <<< grid, block, 0, stream1 >>> ( …, dev3, … ) ;

Each kernel in a stream works on the same memory space as its previous one.
How can I make sure that kernel_2 finished running before calling kernel_3 ?

Thank you,

If the work is independent, there should be no need to ensure that one is finished before executing the next. Nevertheless, to answer your question, to make sure that kernel_2 is finished before running kernel_3, you could do exactly what you have shown, i.e. launch kernel_2 into a particular stream, and then launch kernel_3 into the same stream.

Stream semantics are really simple:

  1. Items launched into the same stream will have their execution serialized, in launch order.
  2. Items launched into separate streams have no defined ordering relationship enforced by CUDA streams.

Hi Robert,

Thank you very much for the fast reply.
My code looks like the following:

for (int i=0;i<N;i++)
    err = cudaStreamCreate (&stream[i]);

for (int i=0;i<N;i++)
    kernel_1 <<< grid, block, 0, stream[i] >>> ( …, dev2, … ) ;
    kernel_Last <<< grid, block, 0, stream[i] >>> ( …, dev3, … ) ;

for (int i=0;i<N;i++)
    err = cudaStreamSynchronize (stream[i]);

Is this the right and fastest way to make sure that all streams finished ?

Best regards,

The middle for-loop looks strange to me. Maybe it is correct/what you intend. I don’t know.

The final for-loop should synchronize all streams. Or you could just do a cudaDeviceSynchronize there.