Performance tests and cudaThreadSynchronize

I’m performing some performance test on my cuda app.
There’s an intentional leak of information about the role of the
cudaThreadSynchronize function:
In the guide I found something like
“Blocks until the device has completed all preceding requested tasks.”

But I would to know exactly HOW the block is realized?
I’ve some ideas but I don’t know which of those could be the most plausible:

  • the kernel unlocks the synchronization function
  • the synchronization function polls some condition flag (software or hardware)
  • the synchronization function sleeps in a queue of “ready job” inside the gpu scheduler data structure

Thank’s

I think it is a pretty simple polling loop/spinlock against a status flag on the GPU. A few releases ago the CUDA API exposed the ability to set the polling interval or modes.

I think it is a pretty simple polling loop/spinlock against a status flag on the GPU. A few releases ago the CUDA API exposed the ability to set the polling interval or modes.