In my application, i need to iteratively calculate a vector “A” from another vector “B”, after finishing one iteration, I need to start another iteration using “for” loop after doing B_k=A_(k-1).

I use multiple streams for my application.

Different kernels assigned to different streams calculate the part of A from the part of B.

I use cudaStreamSynchronize() for synchronization.

The problem is that I can get accurate results for the first iteration (that is when “iter=0” in the following codes.).

However, I get wrong results from the second iteration.

I don’t know the reason. My synchronization method is wrong?

The pseudo-codes are as follows:

cudaStream_t streams[num_streams];

for (int i = 0; i < num_streams; i++) {

cudaStreamCreate(&streams[i]);

for(int iter=0;iter<Max_iter;++iter)

{

cudaMemset() //reset “A” to zero

Kernel1<<, stream[0]>>;

Kernel2<<, stream[1]>>;

…

KernelN<<, stream[num_streams-1]>>; //different kernels are assigned to different stream

// each kernel calculate the part of A from the part of B

for (int i = 0; i < num_streams; i++) {

cudaStreamSynchronize(streams[i]);

}

cudaMemcpy () //copy A to B

}

for (int i = 0; i < num_streams; i++) {

cudaStreamDestory(streams[i]);