Controlling access to shared variables

I’m currently running a loop within a kernel that sets a shared bool value whether to repeat again. I’m getting some unstable behavior in that the bool is randomly not changing when it should be and the loop is failing to repeat.

Whats the best way to lock a shared variable so only one thread can access it at once?

Thanks in adv

Atomic functions (see Appendix B.11 of the Programming Guide).
However, your example should work without atomic memory access: clear the flag in shared memory, __syncthreads(), set flag in any thread that wants another iteration, __syncthreads(), test flag and iterate accordingly.

Atomic functions (see Appendix B.11 of the Programming Guide).
However, your example should work without atomic memory access: clear the flag in shared memory, __syncthreads(), set flag in any thread that wants another iteration, __syncthreads(), test flag and iterate accordingly.

You need a __syncthreads() after your loop, before you test the flag again. You also need to syncthreads() at the START of the loop so that you don’t change the flag in the loop while other threads are still at the end of the previous iteration and haven’t tested the flag yet.

__shared__ bool flag;

if (0==threadIdx.x) flag=false;

__syncthreads();

while (!flag) {

	 __syncthreads();

	 /* do work here, maybe setting flag */

	__syncthreads();

}

You need a __syncthreads() after your loop, before you test the flag again. You also need to syncthreads() at the START of the loop so that you don’t change the flag in the loop while other threads are still at the end of the previous iteration and haven’t tested the flag yet.

__shared__ bool flag;

if (0==threadIdx.x) flag=false;

__syncthreads();

while (!flag) {

	 __syncthreads();

	 /* do work here, maybe setting flag */

	__syncthreads();

}

This problem has an alternative solution by alternating between two variables for odd and even iterations.

This problem has an alternative solution by alternating between two variables for odd and even iterations.

How would that look as pseudocode?

How would that look as pseudocode?

Something like this:

[codebox] shared short flag[2] = {0,0}; // loop control

int i = 0;

do {

i = !i; 

... 

if(condition)

  flag[i] = true;  

if(!tid) // thread ID zero …

  flag[!i] = 0;     //  .. resets the other flag

__syncthreads();

} while(flag[i]);

[/codebox]

EDIT: Although it looks like a char flag[4] is what is really needed to be sure not to overwrite anything…

Something like this:

[codebox] shared short flag[2] = {0,0}; // loop control

int i = 0;

do {

i = !i; 

... 

if(condition)

  flag[i] = true;  

if(!tid) // thread ID zero …

  flag[!i] = 0;     //  .. resets the other flag

__syncthreads();

} while(flag[i]);

[/codebox]

EDIT: Although it looks like a char flag[4] is what is really needed to be sure not to overwrite anything…

There’s a race in that code.

After the __syncthreads(), all the warps are released to run. Suppose warp 0 goes very quickly. It tests the while(), and goes back to the top of the loop, doing its condition test, then thread 0 resets the “old” flag to 0. But slowpoke warp 1 may still be released from the __syncthreads(), but still unadvanced. It now tests the while() condition, which is now FALSE because warp 0 reset it… and warp 1 now exits the loop inappropriately.

Such a race would be rare because warp 0 needs to get pretty far ahead, but it’s still a race. It would likely (properly) hang the 2.3/3.0 emulator, for example.

Edit: You’re right that a array of 4 flags would solve this… I think even just 3 would.

There’s a race in that code.

After the __syncthreads(), all the warps are released to run. Suppose warp 0 goes very quickly. It tests the while(), and goes back to the top of the loop, doing its condition test, then thread 0 resets the “old” flag to 0. But slowpoke warp 1 may still be released from the __syncthreads(), but still unadvanced. It now tests the while() condition, which is now FALSE because warp 0 reset it… and warp 1 now exits the loop inappropriately.

Such a race would be rare because warp 0 needs to get pretty far ahead, but it’s still a race. It would likely (properly) hang the 2.3/3.0 emulator, for example.

Edit: You’re right that a array of 4 flags would solve this… I think even just 3 would.

Yes 3 would be enough, but they come handily in boxes of 4 …

:)

Yes 3 would be enough, but they come handily in boxes of 4 …

:)