I’m trying to benchmark operations in cufft and cublas. I have found that I have to call cudaThreadSynchronize() after cublasSgemm, to wait for that function call to complete… ok, this makes sense.
However, I have see in an fft example (convolutionFFT2D) where they do a fwd fft, then a multiply, all followed by an inv fft, without doing any cudaThreadSynchronize() calls. This suggests to me that the threads are synchronized within the cuda library calls, and not required afterwards.
The documentation doesn’t seem to address this clearly.
CUDA operations are guaranteed to run in order, with the results of the previous operation finished before the next operation starts, unless you assign them to different streams. Kernel launches, for example, are asynchronous and return to the CPU thread before completion. However, if you launch many kernels in a row, they are queued up and run sequentially. If you follow that with a cudaMemcpy(), the memory copy waits for all previous kernel launches to finish, then runs.
For the most part, cudaThreadSynchronize() is only useful for benchmarking, to ensure you are timing the full kernel execution. It is almost never needed for program correctness.
I have a somewhat similar issue with consecutive kernel launches: I expect each kernel to run in parallel but the kernels themselves to run in sequence (according to the launches). Right?
I actually have a sequence of kernel launches like: (each kernel operates on some vector in GPU memory)
kernel_a<<<dimGrid, dimBlock>>>(…);
kernel_b<<<dimGrid, dimBlock>>>(…);
kernel_c<<<dimGrid, dimBlock>>>(…);
but the resulting vector from the GPU memory (copied using cublasGetVector and printed on the screen) is not what I expect it to be (result is usually garbage). But if I intersperse the print calls (which involve cublasGetVector) I mysteriously get the correct result that I am expecting! Every single time!!
There are two things to this behavior that I cannot understand:
Why aren’t the kernel launches sequential? (OR: Where could I be messing up to get garbage output? I don’t get garbage output with the print functions in between the kernel calls!)
How do memory copy functions affect the kernel launches? (In specific, how do they affect my program and yield the correct result? Without the memcopies I get wrong results…)