error results from Stream in "for" loop

In my application, i need to iteratively calculate a vector “A” from another vector “B”, after finishing one iteration, I need to start another iteration using “for” loop after doing B_k=A_(k-1).
I use multiple streams for my application.
Different kernels assigned to different streams calculate the part of A from the part of B.
I use cudaStreamSynchronize() for synchronization.
The problem is that I can get accurate results for the first iteration (that is when “iter=0” in the following codes.).
However, I get wrong results from the second iteration.
I don’t know the reason. My synchronization method is wrong?

The pseudo-codes are as follows:

cudaStream_t streams[num_streams];

for (int i = 0; i < num_streams; i++) {

for(int iter=0;iter<Max_iter;++iter)
cudaMemset() //reset “A” to zero

Kernel1<<, stream[0]>>;
Kernel2<<, stream[1]>>;

KernelN<<, stream[num_streams-1]>>; //different kernels are assigned to different stream
// each kernel calculate the part of A from the part of B

for (int i = 0; i < num_streams; i++) {

cudaMemcpy () //copy A to B


for (int i = 0; i < num_streams; i++) {

stream synchronization does not necessarily guarantee block or grid synchronization
are you sure your kernels are independent - work on independent data and can not create races?

Thank you very much for your comments.

Basic idea is that I want to finish the calculation of all the elements of “A” from “B” in each iteration with different streams.
I use cudaStreamSynchronize() to wait for all the streams. If the streams are finished in each iteration, the calculation of all the elements of “A” should be done, right?

If the calculation goes to next loop, what is the difference from the first iteration, that is “iter=0”?

What is the meaning of “stream synchronization does not necessarily guarantee block or grid synchronization”?

“What is the meaning of “stream synchronization does not necessarily guarantee block or grid synchronization”?”

it means there are 3 possible levels of synchronization: block, grid, stream
depending on the code and the data, you may need to synchronize on one, multiple or all
this is to prevent data races - reading before writing, etc

within blocks, all threads must commit writes, before threads are allowed to read, etc
across blocks, blocks may need to wait for each other to prevent races
a stream may need to wait for another stream, before it can continue processing

you may perhaps have to let all streams wait for the memory set first

cudaMemset() //reset “A” to zero

i am not sure about cudaMemset - i have not used it before
but it may constitute a memory copy in principal, such that all subsequent streams (kernels) referencing it should synchronize on it first; otherwise you have a race - you are reading before you have completed writing

I realized that I need to wait cudaMemset() and cudaMemcpy() need to finish their job for next step
I put cudaDeviceSychronize() after cudaMemset() and cudaMemcpy().
it doesn’t work.

if it works correctly the 1st run, but not the 2nd, you may have races elsewhere

run memcheck and racecheck

if that does not help, catch the kernel the 2nd run, or dump its input data the 2nd run, in order to see which value where diverges

Thank you very much, little_jimmy.

I found the problem. It is not from Stream synchronization.
As what I said, I have different kernels assigned to different streams.
Totally, I have 11 kernels. For the first 10 kernels, BlockDim.z=2, the 11th kernel, BlockDim.z=1. Other parametes are same.

One more interesting thing I didn’t mention is about the running time. for one iteration, the running time is 5s. If I run 5 iterations, the total running time should be about 25s. if I don’t run the 11th kernel, the running time is very close to 25s. However, if I run all 11 kernels, the running time is almost half, that is about 12s.

I changed BlockDim.z=2 for the 11th kernel, everything is ok. However, if I change BlockDim.z back, the problem is back. I don’t know the reason.

Now, I increase the kernels and let BlockDim.z=1 for each kernel. The results are correct.

Any suggestion for this weired problem?

and what does the 11th kernel do?
what does blockdim.z do such that you can change it like that?

all the 11 kernels are same. The reason I need to split BlockDim.z is the total dimension is more than 1024. I have to split BlozkDim.z to reduce the dimension of the block.

do the kernels have functions? what do the kernels do, other than compute?
do you have conditionality in the kernels as functions, or not?

I didn’t call any function. Anyway, if I have time, I will further take a look at it and keep you posted.
thanks a lot.

functionality - i was referring to functionality