I am new to CUDA programming. I am working on some project(optimizing) where I am calling kernel with 64 blocks and 32 threads. The application on GPU is fully parallized, for debugging purpose when I take 1 block with 32 threads then output comes correct. But if I use 64 blocks then the output is fluctuating. I think that I am not able to synchronize the blocks. The application is divergent type so, making an another would not be a good idea. Am I correct?
I also used cudaDeviceSynchronize() after calling to kernel in the host code, it is not working as well.
Any idea how to synchronize the blocks?
how to synchronize the blocks, would much depend on the output of the blocks/ each block - something you did not mention
what does a block do, and what is “the output”?
What do you mean by “synchronizing blocks”? You want block 1 to be executed after block 0 is finished? That is not possible. Each block is executed independent from the other blocks. (For your case, you could call the kernel multiple times, but calling 64 kernels with 1 block and 32 thread is definitely not a good idea.) You should try to rewrite your code in a way that blocks do not depend on other blocks…
I don’t understand what you want to say.
What do you expect from cudaDeviceSynchronize()? It means the execution on the host is blocked until all device operations are finished.