I am newbie to CUDA. I have one doubt. When we call a kernel function in a loop (just for some iterations only), is it necessary that all the threads launched in the previous call will complete their tasks before call to kernel in the next iteration ??
In my case the data flow is like this " Global Memory (I)----> Shared Memory(II)------>Global Memory (III)
The threads do some manipulation in the data available in the shared memory and store it back to global memory. Now the data available in global memory is used by the threads launched in the new kernel call. I have used __synthreads() as the end statement in the kernel function( which may ensure completion of work by all the threads) but I m getting wrong output after 18 iterations… Is it due to improper synchronization of threads or something else ?