Global thread barrier

I’m working on an algorithm which periodically requires that all threads be synchronized at a barrier, such that no thread continues until all threads have reached that point. __syncthreads() won’t do it, because that only applies to the threads in a block. I need to synchronize all threads across all blocks.

One obvious solution is to break the algorithm into multiple kernel invocations. That would work, but it would be really slow, since it introduces the latency of a kernel invocation for every synchronization.

A potentially faster solution is to implement a global thread barrier using atomic operations on global memory. Has anyone tried doing this? Did it work well? Are there any tips I should know about to get good performance?

Are there any other approaches to solving this problem I should consider?

Thanks!

Peter

1 Like

Many in the forums have tried and posted their results.

No, it works horribly. As an example of the simplest problem encountered: not all blocks will run concurrently. Therefore, your global barrier will deadlock.

Yes. Just use multiple kernel invocations. Is 10 microseconds that bad an overhead to pay?

Unless I’m reading the description wrong, it seems like the threadfence() function in CUDA 2.2 seems like a global synchronization function.

Can you point me to any of them? I’ve searched the forums, but didn’t find any.

Yes it is. If I have to synchronize ten times, that’s 100 us of overhead for a kernel that probably does less than 10 us of real work, and that gets executed >1000x per second.

Other options I thought about:

  • Run the entire kernel as one thread block so I can use __syncthreads(). That limits it to running on one SM, but might still end up being faster.

  • Download the input data to the CPU, run the algorithm there, then upload the results back to the GPU. :(

Peter

I haven’t looked at CUDA 2.2 at all. Thanks for the pointer - I’ll take a look.

Peter

That’s what I thought initially too, but apparently it’s not really that at all. I promise, __threadfence() is the most confusing function in CUDA. It looks like a global synchronization barrier, but I think it’s something close to “block either all or part of the memory controller (if depending on if you use __threadfence() or __threadfence_block()) from issuing more transactions until all transactions in flight have been completed.”

This may be slightly off (I’m trying to remember a description given several months ago), but it’s really not a global synchronization barrier/primitive. It won’t break if you’ve got more blocks than SMs, for example. (or so I remember, maybe I’m wrong!)

I was a bit worried about that from the description. I think it would be helpful to expand the documentation on these functions in the programming guide to make clear, for example, what the difference between threadfence_block and syncthreads is and what exactly the threadfence functions do and can be used for.

Yeah, I was right. It’s not a global synchronization barrier at all, and it will not break anything if you have more blocks than can fit onto the GPU at once. Call __threadfence() from a warp with in-flight writes to make those writes visible everywhere else in the GPU.

Yeah, it’s a tricky concept! I view it as a cache buffer flush. Kind of like fsync() in Linux.

Hey, Tim, when you next talk to the driver guys, you might mention to them that since there’s a threadfence() now, it opens the possibility of writable textures! Currently textures in CUDA are read-only since CUDA can’t guarantee the texture caches could see any writes… there would have to be fancy hardware synchronization between the memory controller and caches to tell them when the caches go stale, etc. But with a threadfence(), you know all writes are committed and you just need to mark all the caches as invalid (like it does at the start of the kernel)… and boom, we now can write to textures and use threadfence() to make the new texture changes safely visible.

Can’t you do this now with texturing from pitchlinear (e.g., normal) memory and __threadfence()?

But the texture caches might still be populated with stale values, so your queries would be incorrect. You need to flush the texture caches during the threadfence() halt.

Stupid internal server errors! I lost the entire reply to the ether… here is the short version.

Google is better:

http://www.google.com/search?hl=en&cli…art=10&sa=N

http://forums.nvidia.com/index.php?showtop…amp;mode=linear

http://forums.nvidia.com/lofiversion/index.php?t43673.html

http://forums.nvidia.com/index.php?showtopic=40227

There were a couple more good hits, but the internal server error ate them.

Unfortunate.

I do this for a final pass in a reduction. It works effectively when the amount of work to be done is tiny.

Have you benchmarked this? I’ve got a few algorithms in HOOMD where this is actually faster, copy times included. Given that you are in a region where 10 us is a stiff penalty, it may be the case for you too if the data to be copied isn’t too large.

Thanks! None of those is really the same as what I have in mind, though. Assume the number of blocks equals the number of SMs, so there is one block running on each SM, and that syncCounter points to a location in global memory which has been initialized to 0. I would then do something like the following:

[codebox]__syncthreads();

if (threadIdx.x == 0)

atomicInc(syncCounter, gridDim.x);

while (*syncCounter < gridDim.x)

__syncthreads();[/codebox]

(I’m just writing this off the top of my head, so please forgive any mistakes.) Do you see any reason that wouldn’t work?

Peter

From a first look, syncthreads in the while loop is asking for trouble.

In principle what you’re asking for should be possible. I don’t see why the atomics can’t give you global synchronization.

I have to agree that this is bad, because it violates the basic principle that each block should be an independent unit of work. Once you go against the model, all sorts of bad things can happen to you.

You really should try running multiple kernels before you discount that approach. It’s probably not as bad as you think, and you can get higher occupancy by running more blocks than SM’s, so it could be substantially faster. CUDA gets good performance even on kernels where each thread does only a few operations, so the notion that your units of work are lightweight might not really be as significant as you think.

Your specific approach (if I may paraphrase below), has a problem in that after the last block increments and sets syncCounter back to zero, a block might get to the next sync stage before all the other threads finish evaluating “while (*syncCounter)” meaning some blocks can get stuck on previous steps. If you used syncCounter1 and syncCounter2 then it might work…

if (threadIdx.x == 0) {

  atomicInc(syncCounter, gridDim.x);

  while (*syncCounter); // empty body of while loop

}

__syncthreads();

Good point. It’s not really needed, and could cause deadlocks. The __syncthreads() call in the first line really is needed, though. Without that, the counter could get incremented even though not all threads in this block were done.

I thought that occupancy was based solely on the number of warps per SM? For example, a single block per SM with 320 threads should give exactly the same occupancy as 10 blocks per SM with 32 threads each.

No, the last block sets it to gridDim.x, not to 0. According to the description of atomicInc() in the CUDA programming guide, the new value stored in memory is ((old >= val) ? 0 : (old+1)).

Anyway, I’m going to go ahead and try out this approach. I’ll post back to let you know how it goes.

Peter

Yes, I think you’re right on all three counts! My mistake.

I am actually trying to implement the same thing. Basically, I also need global synchronization between all the threads. Your approach seems reasonable to me as long as the number of blocks is not more than SMs.

Any lock on getting this working? Did you successfully test this?
Thanks,
Amir

I’ve tried a lot of things, but so far have been unable to make it work reliably. At first I thought it was working exactly as described above. Then I discovered that it only worked if I compiled my program in debug mode, but deadlocked if I compiled in release mode. (I have no idea why this should make any difference - how I compile my CPU code shouldn’t affect what works on the GPU - but there it is.) Then I found something that worked consistently on my 8600M GT, but when I tried it on a GTX280, it didn’t properly block all threads.

Here’s what worked on the 8600M GT, in case someone else wants to experiment with it further:

[codebox]device void syncAllThreads(unsigned int* syncCounter)

{

__syncthreads();

if (threadIdx.x == 0)

    atomicInc(syncCounter, gridDim.x-1);

__shared__ int counterValue;

do

{

    if (threadIdx.x == 0)

        counterValue = *syncCounter;

} while (counterValue > 0);

}

[/codebox]

Peter

Somewhat related discussion here.

At least two lessons: one, is volatile can make a big difference. The other, which I didn’t mention, but I observed along the way when inspecting ptx code, is that an infinite loop like while(*p); can get removed entirely by the compiler! I guess if *p is not volatile, the compiler statically determines the loop has no effect and removes it (or something like that).

Another lesson is that if some threads are spinning waiting for the action of other threads within the same warp, they can easily deadlock. But that’s not the problem you’re facing because you have only one thread per block doing the synchronizing.

But aside from that, what you described before sounded like it would work.

Does this work? (I’m not near my dev machine so I can’t test right now)

__device__ volatile int syncCounter1 = 0;

__device__ volatile int syncCounter2 = 0;

__device__ void syncAllThreads()  {

  __syncthreads();

  if (threadIdx.x == 0) {

	atomicInc(&syncCounter1, gridDim.x-1);

	while (syncCounter1); // empty body of while loop

	atomicInc(&syncCounter2, gridDim.x-1);

	while (syncCounter2); // empty body of while loop

  }

  __syncthreads();

}

(I’ve got basically two barriers to prevent one block from racing ahead and incrementing syncCounter1 too fast, before other blocks get past while(syncCounter1);)

Thanks, I had forgotten that CUDA supported the volatile keyword! Using that, I was able to get it working reliably on the GTX280. Here it is:

[codebox]device void syncAllThreads(unsigned int* syncCounter)

{

__syncthreads();

if (threadIdx.x == 0)

    atomicInc(syncCounter, gridDim.x-1);

volatile unsigned int* counter = syncCounter;

do

{

} while (*counter > 0);

}

[/codebox]

Note that I’m only using one counter, not two like you are. This makes it faster, but it also means that if you call it multiple times in the same kernel, you must not use the same counter twice in a row.

Now the bad news: each call to this function is taking about 10 us, which means it’s actually slightly slower than just calling multiple kernels. For example, my test application takes 38 seconds with this function and 35 seconds when using multiple kernels instead. Any ideas for how to make it faster, or even to figure out what the bottleneck is? I assume it’s either the atomicInc (I’ve heard atomic operations are slow in the presence of contention) or the loop (repeated memory reads), but I don’t see any obvious solution to either one.

Peter