I have N matrices that should be processed.
The results in each matrix does not depend on the results from the other N-1 matrices.
They are all independent.
In order to work with streams I want to create N streams and then run the processing in each of them.
cudaStreamCreate ( &stream1) ;
kernel_1 <<< grid, block, 0, stream1 >>> ( …, dev2, … ) ;
kernel_Last <<< grid, block, 0, stream1 >>> ( …, dev3, … ) ;
Each kernel in a stream works on the same memory space as its previous one.
How can I make sure that kernel_2 finished running before calling kernel_3 ?