cudaDeviceSynchronize - blocks only GPU for the host (CPU) thread in which it is called, or does it

I have a question regarding the function cudaDeviceSynchronize.
We use currently the - now deprecated - function ‘cudaThreadSynchronize’ which blocks the GPU only for the host (CPU) thread in which this function is called.
Does ‘cudaDeviceSynchronize’ (which is proposed by NVIDIA as replacement for cudaThreadSynchronize) has the exact same behaviour, so does it block the GPU also only the host thread in which it is called ? Or does it block the GPU for all host threads (which would be bad) ?

I don’t think either of them care about the host thread. The cudaThreadSynchronize manual mentions it doesn’t do what its name specifies. Both of them just block until the device is done, regardless of what thread called it.

What shaklee3 wrote is correct:
I’ve tested it some time ago with multithreaded cuda app: Two identical, long kernels, each with enough resources to “fill” the whole device, running simultaneously on two independent cpu threads. In case of using cudaThreadSynchronize it returned control after BOTH kernels finished, thus synchronizing the whole device regardles of cpu threads. The solution to this is to use cudaStreamSynchronize and assignig each kernel with a different stream. This results in synchronizing only appropriate kernels, which is probably the thing You want to do.

Note: beware of global variables/structures/textures which are allocated per device and thus NOT “thread-safe” with toolkit 4.0 and later. See discusion below:

Thx for the answers. I am aware about the issue with global variables and texture objects, see my posting at

In fact, I still dont have a solution for this for Fermi architecture GPUs (for Kepler, one can use texture objects)