What is the difference between cudaThreadSynchronize and cudaDeviceSynchronize? It seem like a lot of example programs use cudaThreadSynchroniz. But recent NVidia documentation says
So, I used cudaDeviceSynchronize in my program. After seeing that it crashes, I switched to cudaThreadSynchronize and now it doesn’t crash.
Anyway, with no explicit synchronization line, I get occasional bad results. With cudaThreadSynchronize, I get reliable results. With cudaDeviceSynchronize, it crashes later (eventually I get an error from GpuToHost()).
This is all with toolkit 4.0 and a recent driver installed, using a 560 gtx ti. On other machines, with older cards, toolkits, and drivers, I get along fine without any explicit sync.
I guess this was a tougher question than I thought. Setting aside the issue of why my program crashed, does anybody know what is the intended difference between these two CUDA calls? The documentation is not clear on this point.
Also, I have seen pages that say that cudaMemcpy from device to host is always synchronous. Yet, I think I’ve seen other pages that say it may be asynchronous below a certain transfer size. Which one is true?