While loop with streams and concurrency

I need to know if the following concept is going to work. On a GPU I have 2D matrices in memory. For each matrix I have a while loop that must converge. The number of iterations resulting in convergence can be different for each array. The while loop will iterate until the processing for all matrices has converged. In the loop there are two calls to cublasCgemmBatched. The first multiplies each matrix by a vector resulting in a new set of vectors. The new vectors are processed with a kernel function. The result of that processing is input into the second call to cublasCgemmBatched. Each new vector is multiplied into the its corresponding matrix. The result is processed by another kernel that applies the convergence criteria and fills in a vector of bool. Another kernel call loops through the data and copies the corresponding vector to an output buffer (still on device) if its is bool is true. Then all the bools are copied to host and a loop determines if they have all converged yet or not. When they are all converged the loop stops.

The loop converges for all matrices but the output buffer is not always filled in.

Question: Is that basic concept going to work? Am I missing something like a sync?

Thanks, Roger

Yes, this can work but appears to require stream synchronization which has caused serialization where I once had full concurrency over 4 GPUs.