Many threads updating a single flag in global memory

Asiri_Rathnayake · May 7, 2011, 4:39pm

Hi All,

This is my first post and I’m completely new to CUDA :)

I’m working on porting a certain application to CUDA in order to leverage its benefits. So far everything checked out except that I have some trouble understanding memory access performance limits.

In my application some threads need to update a common boolean flag in global memory, initially the flag is set to false and all updating threads (if there are any) would only be setting it to true. Since it’s possible that there might be so many threads attempting to set this flag to true (without any atomic operation), should I be concerned about performance? or correctness?

Sorry I could not figure out an answer for this from the guide or anywhere on the web (yet).

Thanks a bunch :)

hyqneuron · May 8, 2011, 3:22am

I suppose in your context the only thing matters is whether the flag is true or false, not which thread exactly that has updated the value. If that’s the case, in theory you wouldn’t even have to use atomic operation. Just use normal read and write and you will get the expected result and performance. If the writing/reading of the flag is the most frequent global memory operation you have, you could disable L1 cache to allow the use of 32-byte cacheline so as to save some bandwidth.

Asiri_Rathnayake · May 8, 2011, 8:43am

Awesome! I previously suspected that multiple threads accessing the same global variable might cause contention, but now I see it’s not the case.

Thanks again External Image

hyqneuron · May 8, 2011, 9:02am

When threads from the same warp read the same address, the value is actually broadcast to all the threads. Writing is another story, though. I’m not sure whether only 1 write from 1 warp will somehow survive (and the other 31 eliminated) or all the 32 writes will be serialized. Perhaps you could do a little microbenchmarking to find that out.

Asiri_Rathnayake · May 8, 2011, 9:22am

Yesterday I went through the programming guide more closely and I found:

On section 4.1 (SIMT Architecture) it says,

“If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”

On section G.4.2 (Compute Capability 2.x, Global Memory)

“If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.”

I could not find any similar remark given for global memory access of 1.x devices, however I believe it must be quite the same (unpredictable).

Still, the nature of my application allows me to not worry about this issue I believe, since it doesn’t matter which thread or how many threads do the write, if at least one thread performs the flag change, that would be adequate for me to get the correct result.

I initially thought of staging the flag update, use a shared memory flag as an intermediate step. But soon I realized that this is not possible because there is no way of selecting one of the threads in a warp in order to perform the flag transfer from shared memory to global memory. I hope I have understood it correct External Image

tera · May 8, 2011, 9:49am

That is indeed a worthwhile optimization, and it’s quite easy to have just one thread in a block update the global flag:

__syncthreads();

    if ((threadIdx.x==0) && (threadIdx.y==0) && (threadIdx.z==) && shared_flag)

        global_flag = 1;

Asiri_Rathnayake · May 8, 2011, 11:03am

That is indeed a worthwhile optimization, and it’s quite easy to have just one thread in a block update the global flag:
__syncthreads();

    if ((threadIdx.x==0) && (threadIdx.y==0) && (threadIdx.z==) && shared_flag)

        global_flag = 1;

Wow! This is great!

External Image

So we introduce a data dependent branch that is followed by only one thread while the rest of the threads wait (kind of suspended) until that particular thread finishes its job. I was sort of obsessed with traditional synchronization methods and never thought of something like this. Is this sort of a general practice in SIMT architectures to achieve a turnstile like thread behaviour? (it feels so).

Pretty clever / awesome! External Image

seibert · May 8, 2011, 5:12pm

This is a fairly common trick for kernels that reduce a lot of input data into a few outputs. You obviously lose a lot of throughput in a single-thread kernel section, but if the goal is to write a few words after a long calculation, then this is a straightforward way to do it.

Asiri_Rathnayake · May 8, 2011, 8:02pm

Thanks for the clarification! External Image

To go a little off-topic, is it possible to extend the same idea to all the threads? I mean with this approach only one thread gets to proceed while the rest of the threads in the same block have to wait until this particular thread is done. However, threads belonging to other blocks will simply skip this conditional and go on with their duties. What if I want to make sure that nobody (absolutely nobody) else gets to continue until one thread finishes off some small house-keeping task?

Right now my guess is for this kind of a situation I have to break the kernel execution and get some support from the host thread. But this means I have to make multiple kernel launches only because of that tiny little house keeping task External Image

Thanks for any opinions / comments External Image

tera · May 9, 2011, 12:24am

It’s not that easy to do as the order in which blocks execute is undefined. So usually you would just launch another kernel for that, as kernel launches aren’t that expensive. However appendix B.5 “Memory Fence Functions” of the 3.2 Programming Guide gives example code how to achieve this with a single kernel launch.

tmurray · May 9, 2011, 12:38am

Global synchronization within a kernel is a very, very bad idea.

tera · May 9, 2011, 1:01am

So does that mean you even discourage use of the example code from the Programming Guide?

Asiri_Rathnayake · May 9, 2011, 8:44am

Well, it appears fence functions allow us to select one block of threads to perform some house-keeping task at the end of a calculation. However, all other threads will still go on with their duties pass the house-keeping code block (in the given example they just die).

My requirement was to be able to sort of suspend all other threads until one thread performs some house-keeping task and then only the rest of the world gets to proceed. I guess this is what tmurray called a “really really bad idea” External Image

Thanks a lot for your opinions, they helped me to broaden my understanding of CUDA! External Image

tmurray · May 9, 2011, 6:29pm

Are you talking about the __threadfence example? I think that’s fine. The problem is when you have code where the correctness assumes that some blocks will be running concurrently and you’re waiting within a kernel to try and enforce that.

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	56071	March 8, 2007
Multiple writes to global memory CUDA Programming and Performance	2	2210	May 6, 2008
Convergence flags+Reduction OR concurrent global memory writing CUDA Programming and Performance	0	3664	December 16, 2009
Making changes to Global Memory visible CUDA Programming and Performance	3	1915	May 24, 2009
Current best practices for controling global memory updates CUDA Programming and Performance	4	776	April 12, 2017
Conditional write to global memory CUDA Programming and Performance	3	3538	September 4, 2007
Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1 CUDA Programming and Performance	10	4190	May 4, 2007
Inter-block Communication Use of global variables to enable thread synchronization CUDA Programming and Performance	1	4926	July 8, 2010
global memory read after write CUDA Programming and Performance	4	3394	March 25, 2009
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8119	July 20, 2007

Many threads updating a single flag in global memory

Related topics