cudaThreadSynchronize() and multiple kernels when is it necessary to sync?

pixelhead · June 20, 2008, 6:57pm

Hi -

I have a large algorithm which I have broken down into 10-12 smaller kernels. The algorithm loops many times (300-500 times) and each iteration calls all the kernels one after another.

The problem is some of the kernels crash (I’m getting #QNAN results) but not predictably. I believe I have traced it down to a synchronization issuse since when I run in debug mode and I call CUT_CHECK_ERROR after each kernel invocation the problems go away (no errors are reported but the CUT_CHECK_ERROR() macro calls cudaThreadSynchronize()).

So my question is when is it necessary to call cudaThreadSynchronize()? Many of the kernels rely on data computed by previous kernels so I imagine cudaThreadSynchronize() would be needed. But, although I understand that calling kernels from C is asynchronous, I thought if multiple kernels are called, they would block if the GPU was busy. Or do you think some of the host to device memory operations (mainly cudaMemcpyToSymbol) are occuring when they shouldn’t?

Is there a large overhead for calling cudaThreadSynchronize()? Just wondering how it might affect overall performance.

Thanks for any clarification on this issue!

senorbum · June 20, 2008, 7:22pm

I thought cudaMemcpy’s made their own cudaThreadSynchronize call. Also, I could be wrong but I believe that thread sync shouldn’t hurt performance unless you are calling cudaThreadSync before running a CPU side function that doesn’t require data from the threads. If future kernels require computed data, then a threadSync should be necessary.

Shovelbird · June 20, 2008, 7:23pm

Syncronization is dependant on the code. If you’re depending on a certain order, you will need to syncronize, if the kernels are independant, you do not need to syncronize. As for the performance cost, you really can’t think of it that way, if a kernel requires an order of operations, then you need to ensure it (unless you can sacrifice stabilty/accuracy for performance… External Image ) The cost depends on the latency in your kernels and how full the GPU is… so the only real way is to test.