Many threads updating a single flag in global memory

Hi All,

This is my first post and I’m completely new to CUDA :)

I’m working on porting a certain application to CUDA in order to leverage its benefits. So far everything checked out except that I have some trouble understanding memory access performance limits.

In my application some threads need to update a common boolean flag in global memory, initially the flag is set to false and all updating threads (if there are any) would only be setting it to true. Since it’s possible that there might be so many threads attempting to set this flag to true (without any atomic operation), should I be concerned about performance? or correctness?

Sorry I could not figure out an answer for this from the guide or anywhere on the web (yet).

Thanks a bunch :)

I suppose in your context the only thing matters is whether the flag is true or false, not which thread exactly that has updated the value. If that’s the case, in theory you wouldn’t even have to use atomic operation. Just use normal read and write and you will get the expected result and performance. If the writing/reading of the flag is the most frequent global memory operation you have, you could disable L1 cache to allow the use of 32-byte cacheline so as to save some bandwidth.

Awesome! I previously suspected that multiple threads accessing the same global variable might cause contention, but now I see it’s not the case.

Thanks again External Image

When threads from the same warp read the same address, the value is actually broadcast to all the threads. Writing is another story, though. I’m not sure whether only 1 write from 1 warp will somehow survive (and the other 31 eliminated) or all the 32 writes will be serialized. Perhaps you could do a little microbenchmarking to find that out.

Yesterday I went through the programming guide more closely and I found:

  • On section 4.1 (SIMT Architecture) it says,

“If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”

  • On section G.4.2 (Compute Capability 2.x, Global Memory)

“If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.”

I could not find any similar remark given for global memory access of 1.x devices, however I believe it must be quite the same (unpredictable).

Still, the nature of my application allows me to not worry about this issue I believe, since it doesn’t matter which thread or how many threads do the write, if at least one thread performs the flag change, that would be adequate for me to get the correct result.

I initially thought of staging the flag update, use a shared memory flag as an intermediate step. But soon I realized that this is not possible because there is no way of selecting one of the threads in a warp in order to perform the flag transfer from shared memory to global memory. I hope I have understood it correct External Image

That is indeed a worthwhile optimization, and it’s quite easy to have just one thread in a block update the global flag:

__syncthreads();

    if ((threadIdx.x==0) && (threadIdx.y==0) && (threadIdx.z==) && shared_flag)

        global_flag = 1;

Wow! This is great!

External Image

So we introduce a data dependent branch that is followed by only one thread while the rest of the threads wait (kind of suspended) until that particular thread finishes its job. I was sort of obsessed with traditional synchronization methods and never thought of something like this. Is this sort of a general practice in SIMT architectures to achieve a turnstile like thread behaviour? (it feels so).

Pretty clever / awesome! External Image

This is a fairly common trick for kernels that reduce a lot of input data into a few outputs. You obviously lose a lot of throughput in a single-thread kernel section, but if the goal is to write a few words after a long calculation, then this is a straightforward way to do it.

Thanks for the clarification! External Image

To go a little off-topic, is it possible to extend the same idea to all the threads? I mean with this approach only one thread gets to proceed while the rest of the threads in the same block have to wait until this particular thread is done. However, threads belonging to other blocks will simply skip this conditional and go on with their duties. What if I want to make sure that nobody (absolutely nobody) else gets to continue until one thread finishes off some small house-keeping task?

Right now my guess is for this kind of a situation I have to break the kernel execution and get some support from the host thread. But this means I have to make multiple kernel launches only because of that tiny little house keeping task External Image

Thanks for any opinions / comments External Image

It’s not that easy to do as the order in which blocks execute is undefined. So usually you would just launch another kernel for that, as kernel launches aren’t that expensive. However appendix B.5 “Memory Fence Functions” of the 3.2 Programming Guide gives example code how to achieve this with a single kernel launch.

Global synchronization within a kernel is a very, very bad idea.

So does that mean you even discourage use of the example code from the Programming Guide?

Well, it appears fence functions allow us to select one block of threads to perform some house-keeping task at the end of a calculation. However, all other threads will still go on with their duties pass the house-keeping code block (in the given example they just die).

My requirement was to be able to sort of suspend all other threads until one thread performs some house-keeping task and then only the rest of the world gets to proceed. I guess this is what tmurray called a “really really bad idea” External Image

Thanks a lot for your opinions, they helped me to broaden my understanding of CUDA! External Image

Are you talking about the __threadfence example? I think that’s fine. The problem is when you have code where the correctness assumes that some blocks will be running concurrently and you’re waiting within a kernel to try and enforce that.