Hello,

I have N matrices that should be processed.

The results in each matrix does not depend on the results from the other N-1 matrices.

They are all independent.

In order to work with streams I want to create N streams and then run the processing in each of them.

cudaStreamCreate ( &stream1) ;

kernel_1 <<< grid, block, 0, stream1 >>> ( …, dev2, … ) ;

…

kernel_Last <<< grid, block, 0, stream1 >>> ( …, dev3, … ) ;

Each kernel in a stream works on the same memory space as its previous one.

How can I make sure that kernel_2 finished running before calling kernel_3 ?

Thank you,

Zvika