Interesting question as I’ve just come across an issue using “cudaThreadSynchronize()” after converting to use async memcpys for some of my memory transfers. As far I understood from the Programmers Guide, the cudaThreadSynchronize function makes sure “all streams” are finished before proceeding further. In my case, execution was without any errors, but the final output image was missing some data. By using “cudaStreamSynchronize(0)” instead, it appears to have fixed the issue. I’m only using the default stream 0 for everything.
[Edit/Update]: My mistake. The cudaThreadSynchronize is working as expected for me. I had inadvertently moved some code around with my recent changes and hadn’t fully tested. To answer sawan83’s question, cudaThreadSynchronize() should be fine as it blocks host execution for all streams until all outstanding tasks are complete. Alternatively, you can use cudaStreamSynchronize(stream) if you’re only interested in targetting a specific stream. The choice is yours.