I’m currently running a loop within a kernel that sets a shared bool value whether to repeat again. I’m getting some unstable behavior in that the bool is randomly not changing when it should be and the loop is failing to repeat.
Whats the best way to lock a shared variable so only one thread can access it at once?
Atomic functions (see Appendix B.11 of the Programming Guide).
However, your example should work without atomic memory access: clear the flag in shared memory, __syncthreads(), set flag in any thread that wants another iteration, __syncthreads(), test flag and iterate accordingly.
Atomic functions (see Appendix B.11 of the Programming Guide).
However, your example should work without atomic memory access: clear the flag in shared memory, __syncthreads(), set flag in any thread that wants another iteration, __syncthreads(), test flag and iterate accordingly.
You need a __syncthreads() after your loop, before you test the flag again. You also need to syncthreads() at the START of the loop so that you don’t change the flag in the loop while other threads are still at the end of the previous iteration and haven’t tested the flag yet.
__shared__ bool flag;
if (0==threadIdx.x) flag=false;
__syncthreads();
while (!flag) {
__syncthreads();
/* do work here, maybe setting flag */
__syncthreads();
}
You need a __syncthreads() after your loop, before you test the flag again. You also need to syncthreads() at the START of the loop so that you don’t change the flag in the loop while other threads are still at the end of the previous iteration and haven’t tested the flag yet.
__shared__ bool flag;
if (0==threadIdx.x) flag=false;
__syncthreads();
while (!flag) {
__syncthreads();
/* do work here, maybe setting flag */
__syncthreads();
}
After the __syncthreads(), all the warps are released to run. Suppose warp 0 goes very quickly. It tests the while(), and goes back to the top of the loop, doing its condition test, then thread 0 resets the “old” flag to 0. But slowpoke warp 1 may still be released from the __syncthreads(), but still unadvanced. It now tests the while() condition, which is now FALSE because warp 0 reset it… and warp 1 now exits the loop inappropriately.
Such a race would be rare because warp 0 needs to get pretty far ahead, but it’s still a race. It would likely (properly) hang the 2.3/3.0 emulator, for example.
Edit: You’re right that a array of 4 flags would solve this… I think even just 3 would.
After the __syncthreads(), all the warps are released to run. Suppose warp 0 goes very quickly. It tests the while(), and goes back to the top of the loop, doing its condition test, then thread 0 resets the “old” flag to 0. But slowpoke warp 1 may still be released from the __syncthreads(), but still unadvanced. It now tests the while() condition, which is now FALSE because warp 0 reset it… and warp 1 now exits the loop inappropriately.
Such a race would be rare because warp 0 needs to get pretty far ahead, but it’s still a race. It would likely (properly) hang the 2.3/3.0 emulator, for example.
Edit: You’re right that a array of 4 flags would solve this… I think even just 3 would.