I have a large algorithm which I have broken down into 10-12 smaller kernels. The algorithm loops many times (300-500 times) and each iteration calls all the kernels one after another.
The problem is some of the kernels crash (I’m getting #QNAN results) but not predictably. I believe I have traced it down to a synchronization issuse since when I run in debug mode and I call CUT_CHECK_ERROR after each kernel invocation the problems go away (no errors are reported but the CUT_CHECK_ERROR() macro calls cudaThreadSynchronize()).
So my question is when is it necessary to call cudaThreadSynchronize()? Many of the kernels rely on data computed by previous kernels so I imagine cudaThreadSynchronize() would be needed. But, although I understand that calling kernels from C is asynchronous, I thought if multiple kernels are called, they would block if the GPU was busy. Or do you think some of the host to device memory operations (mainly cudaMemcpyToSymbol) are occuring when they shouldn’t?
Is there a large overhead for calling cudaThreadSynchronize()? Just wondering how it might affect overall performance.
Thanks for any clarification on this issue!