error results from Stream in "for" loop

recrusader · January 27, 2016, 3:22am

In my application, i need to iteratively calculate a vector “A” from another vector “B”, after finishing one iteration, I need to start another iteration using “for” loop after doing B_k=A_(k-1).
I use multiple streams for my application.
Different kernels assigned to different streams calculate the part of A from the part of B.
I use cudaStreamSynchronize() for synchronization.
The problem is that I can get accurate results for the first iteration (that is when “iter=0” in the following codes.).
However, I get wrong results from the second iteration.
I don’t know the reason. My synchronization method is wrong?

The pseudo-codes are as follows:

cudaStream_t streams[num_streams];

for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);

for(int iter=0;iter<Max_iter;++iter)
{
cudaMemset() //reset “A” to zero

Kernel1<<, stream[0]>>;
Kernel2<<, stream[1]>>;
…
KernelN<<, stream[num_streams-1]>>; //different kernels are assigned to different stream
// each kernel calculate the part of A from the part of B

for (int i = 0; i < num_streams; i++) {
cudaStreamSynchronize(streams[i]);
}

cudaMemcpy () //copy A to B

}

for (int i = 0; i < num_streams; i++) {
cudaStreamDestory(streams[i]);

little_jimmy · January 27, 2016, 5:42am

stream synchronization does not necessarily guarantee block or grid synchronization
are you sure your kernels are independent - work on independent data and can not create races?

recrusader · January 27, 2016, 2:32pm

Thank you very much for your comments.

Basic idea is that I want to finish the calculation of all the elements of “A” from “B” in each iteration with different streams.
I use cudaStreamSynchronize() to wait for all the streams. If the streams are finished in each iteration, the calculation of all the elements of “A” should be done, right?

If the calculation goes to next loop, what is the difference from the first iteration, that is “iter=0”?

What is the meaning of “stream synchronization does not necessarily guarantee block or grid synchronization”?

little_jimmy · January 27, 2016, 5:37pm

“What is the meaning of “stream synchronization does not necessarily guarantee block or grid synchronization”?”

it means there are 3 possible levels of synchronization: block, grid, stream
depending on the code and the data, you may need to synchronize on one, multiple or all
this is to prevent data races - reading before writing, etc

within blocks, all threads must commit writes, before threads are allowed to read, etc
across blocks, blocks may need to wait for each other to prevent races
a stream may need to wait for another stream, before it can continue processing

you may perhaps have to let all streams wait for the memory set first

cudaMemset() //reset “A” to zero

i am not sure about cudaMemset - i have not used it before
but it may constitute a memory copy in principal, such that all subsequent streams (kernels) referencing it should synchronize on it first; otherwise you have a race - you are reading before you have completed writing

recrusader · January 27, 2016, 5:47pm

I realized that I need to wait cudaMemset() and cudaMemcpy() need to finish their job for next step
I put cudaDeviceSychronize() after cudaMemset() and cudaMemcpy().
it doesn’t work.

little_jimmy · January 28, 2016, 6:29am

if it works correctly the 1st run, but not the 2nd, you may have races elsewhere

run memcheck and racecheck

if that does not help, catch the kernel the 2nd run, or dump its input data the 2nd run, in order to see which value where diverges

recrusader · January 28, 2016, 3:02pm

Thank you very much, little_jimmy.

I found the problem. It is not from Stream synchronization.
As what I said, I have different kernels assigned to different streams.
Totally, I have 11 kernels. For the first 10 kernels, BlockDim.z=2, the 11th kernel, BlockDim.z=1. Other parametes are same.

One more interesting thing I didn’t mention is about the running time. for one iteration, the running time is 5s. If I run 5 iterations, the total running time should be about 25s. if I don’t run the 11th kernel, the running time is very close to 25s. However, if I run all 11 kernels, the running time is almost half, that is about 12s.

I changed BlockDim.z=2 for the 11th kernel, everything is ok. However, if I change BlockDim.z back, the problem is back. I don’t know the reason.

Now, I increase the kernels and let BlockDim.z=1 for each kernel. The results are correct.

Any suggestion for this weired problem?

little_jimmy · January 28, 2016, 5:02pm

and what does the 11th kernel do?
what does blockdim.z do such that you can change it like that?

recrusader · January 28, 2016, 5:27pm

all the 11 kernels are same. The reason I need to split BlockDim.z is the total dimension is more than 1024. I have to split BlozkDim.z to reduce the dimension of the block.

little_jimmy · January 29, 2016, 6:27am

do the kernels have functions? what do the kernels do, other than compute?
do you have conditionality in the kernels as functions, or not?

recrusader · January 29, 2016, 2:50pm

I didn’t call any function. Anyway, if I have time, I will further take a look at it and keep you posted.
thanks a lot.

little_jimmy · January 29, 2016, 3:17pm

functionality - i was referring to functionality

Topic		Replies	Views
How to synchronize a Kernel with many for loops CUDA Programming and Performance	12	11988	November 28, 2011
stream synchronize problem CUDA Programming and Performance	2	713	August 28, 2017
cudaStreamQuery() works strangely CUDA Programming and Performance	0	1178	October 9, 2013
Questions on Streams CUDA Programming and Performance	5	2148	July 16, 2008
CUDA Matrix Multiplication Kernel Results Inconsistent when blockDim.z >1 CUDA Programming and Performance	2	738	January 28, 2018
cudaStreamSynchronize is much slower than polling on a flag for kernel completion CUDA Programming and Performance cuda , synchronization	8	2106	February 16, 2023
Stream Synchronization Questions CUDA Programming and Performance	1	288	January 17, 2019
For loop runtime optimization with streams CUDA Programming and Performance	0	364	March 17, 2019
Stream synchronization problem didn't synchronize but returned no error CUDA Programming and Performance	0	2774	July 14, 2008
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024

error results from Stream in "for" loop

Related topics