About __threadfence...

While browsing the net I found this comment…

[url=“cryptography - Generate all combinations of a char array inside of a CUDA __device__ kernel - Stack Overflow”]http://stackoverflow.com/questions/1786152...a-device-kernel[/url]

In the first comment, 6th bullet, this person mentions that __threadfence only works as a fence for the currently executing blocks, and hence does not enforce the “flush memory” behavior across all blocks in the grid, but only on the currently active ones. This stroke me as extremely odd since I had never seen a mention of these elsewhere, not to mention that it would break a lot of examples using __threadfence that are dependent on complete interblock communication.

Can someone with more knowledge about the function comment on this please?

To my understanding there’s no way for fence to work across the grid - there’s only a limited number of multiprocessors and each has a fixed number of warps it can have in the queue. Even if a multiprocessor would juggle all available warps so that each got to the fence instruction, there could still be unscheduled warps waiting (for big grids) in which case the MP would have to somehow give its current warps back to the global scheduler (while somehow retaining their instruction counters, shared memory state and such) and receive different ones to get them to run to the fence instruction…

If threadfence was a global, grid-wide instruction, doing whole-kernel synchronization would be trivial and it’s not - partially because of this behavior.

So yes, AFAIK the fence flushes memory of the currently running warp - after the fence you can be sure the memory write that the warp requested has reached global memory.