Named barriers limit occupancy?

This is on a Quadro RTX 6000.
I recently modified a kernel to use named barriers (via inline PTX) instead of __syncthreads, as this better fits my synchronization pattern where I need just pairs of warps to synchronize. This gave me about a 20% performance boost, where I happened to be targeting 2 blocks of 12 warps each.

I then wanted to optimize my block sizes, and discovered I’m limited to an achieved occupancy of 2 blocks. That is, by varying other parameters (launch_bounds, shared memory usage, # warps/block) I can target various occupancies and Nsight Compute agrees I should theoretically be able to achieve those occupancies in terms of the constraints it reports (registers, warps, shared mem, SM) and “theoretical active warps per SM”. But the reported actual occupancy from a profiling run always corresponds to just under 2 blocks worth of warps.

Putting __syncthreads back in for the named barriers and changing nothing else restores expected occupancy.

I don’t see anything in the PTX documentation implying named barriers are a shared resource on an SM. And even if they are, that doesn’t seem to be the issue here, vs a hard block limit.

Is this a known issue/limitation? I can go make a reproducer but wanted to check first.

There are limits to the number of named barriers. I don’t know if it applies in your case.

this and this may be of interest.

Note that an RTX 6000 is a Turing device so the maximum possible occupancy is 32 warps.

Hi Robert, thanks for the quick reply! Everything is under the 32 warp/SM limit.

It would be nice to know the per SM limit for named barriers. I’m also curious how that could actually be enforced by the scheduler (vs a barriers/block rather than SM resource), because the named barrier instruction can take a register. So I don’t see how you could statically analyze how many barriers you’re going to use, to then bound occupancy for the block scheduler.

Regardless, I don’t think that’s the limit I’m hitting here. For example, I have a version of the kernel that has 12 warps/block and each block uses 6 named barriers and targets 2 blocks occupancy, so total 24 warps and 12 named barriers/SM. It achieves 23.06 “achieved active warps per SM” out of “theoretical active warps per SM” of 24. But another variant has 6 warps/block and each block uses 3 named barriers and targets 4 block occupancy (so again total 24 warps and 12 barriers/SM) and it only achieves 11.8 of 24.

Some other datapoints:
2 blocks, 8 warps/block (16 warps, 8 barriers/SM) achieves 29.94 / 32.
4 blocks, 4 warps/block (16 warps, 8 barriers/SM) achieves only 7.89 / 16, which is 2 blocks.
3 blocks, 8 warps/block (24 warps, 12 barriers/SM), achieves 15.65 / 24, again 2 blocks.

So it really seems like a block limit, rather than a function of # of barriers used. Maybe that’s how you get around the static analysis problem; each SM has two sets of named barriers of 16 each, and any named barrier usage limits you to two blocks. But it’s surprising that wouldn’t be documented if that was the case.

I agree there seems to be a kind of blocks-per-sm limit. I’m still studying it.

Based on my own observation, I can observe an occupancy effect, more or less as you describe. The exact effect seems to vary by 1. exact usage (e.g. compile-time visible or not) 2. number of barriers used (in the compile time visible case) and 3. GPU architecture. The “worst” case that I was able to observe seems to line up with your observation, 2 blocks per SM on Turing (the limit seems to be 4 blocks per SM on Volta, for example). This is observable when the barrier ID is not discoverable at compile time. For a barrier ID discoverable at compile time, more blocks per SM are possible, but it will probably depend on exactly how many compile-time visible IDs you are using. I don’t know that any of this is documented, but I have filed an internal RFE to see if we can document it. I can’t give any forward looking statements about if or when it may get documented. If you’d like some visibility, you should probably file your own bug. I used the following code to make my observations, and my comments here are based on those observations:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
        asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
#define nTPB 64
__global__ void k(unsigned long long TDELAY){
        for (int i = 0; i < 1; i++){
          unsigned long long start = clock64();
          while (clock64() < start+TDELAY);
#ifndef SKIP_SYNC
#ifdef USE_FIXED
          namedBarrierSync(0, nTPB);
#else
          namedBarrierSync(threadIdx.x/64, 64);
#endif
#endif
        }
}

int main(int argc, char *argv[]){
        // GTX960: 8 SMs, GT640: 2 SMs
        cudaDeviceProp deviceProp;
        cudaError_t err = cudaGetDeviceProperties(&deviceProp, 0);
        std::cout << "device: " << deviceProp.name << std::endl;
        std::cout << "compute capability: " << deviceProp.major << "." << deviceProp.minor << std::endl;
        int numSMs = deviceProp.multiProcessorCount;
        std::cout << "number of SMs: " << numSMs << std::endl;
        unsigned long long tdelay = 1000ULL * deviceProp.clockRate;
        int block_count;
        for (int i = 0; i < 16; i++){
          if (i%2 == 0) block_count = (i/2)*numSMs+1;
          else block_count = ((i+1)/2)*numSMs;
          std::cout << "block count: " << block_count;;
          unsigned long long dt = dtime_usec(0);
          k<<<block_count,64>>>(tdelay);
          cudaDeviceSynchronize();
          dt = dtime_usec(dt);
          std::cout << " elapsed time: " << dt << "us" << std::endl;}
}

If you can (perhaps via templating) make your barrier ID usage visible at compile time, you may be able to get the best tradeoff between number of barriers used, and number of resident blocks per SM. Sorry, I don’t have a precise formula to give you, but it seems to be discoverable via careful benchmarking. The above code should allow you to begin to explore the compile-time vs. non-compile-time case, to witness the difference for that simple example. If you know that your barrier usage is a fixed pairwise by warps relation (e.g. warps 0 and 1 will always use barrier 0, etc.), it maybe possible to convert to compile-time-visible ID usage. Using most of the barriers may allow you to close to max occupancy: 16 barriers, 32 warps (for Turing, anyway). I haven’t explored that envelope in detail.

Thanks greatly for the information. I’ll try the compile time approach. Currently my blocks look like [32, kNumPairs, 2] and so my named barriers take threadIdx.y. I’ll try throwing that into a switch statement or something like that, at the cost of a little (non-divergent) control logic cost, which might make compiler/scheduler happy enough.

I would also like to point out a new feature. In CUDA 11.1 you can now create threadblock tiles in cooperative groups that are larger than 32 . This means that creating such tiles/groups you can use the .sync() method to provide an execution barrier for the threads in that group. This would be the preferred/“modern” way (although admittedly still “experimental” at this point) to accomplish synchronization within a partial threadblock. Having said that, I don’t know the exact mechanisms this depends on, and there may still be performance issues, but I would give it a try if you can. In the future, any attempt to mitigate peformance issues would probably be exposed here.

Interesting! I see it needs memory, apparently both for synchronization and collective operations. If you only need synchronization, I wonder how it compares to the new aw_barrier. Probably neither performs well on Turing, but worth a try.