Relaxed __syncthreads() proposal.

Hello

I propose to relax __syncthreads() specification and allow it to work inside conditional branches synchronizing only those threads which take that branch. This would potentially allow the hardware to completely throw out some warps of the thread block out of the computations. For example, during “reduce” operation via shared memory, at some point the warps stop contributing to the final result, but since you have to synchronize shared memory access, you have to do __syncthreads() for all the warps, even for those which don’t do anything.

Would do think of this idea ?

Do you really think this has a huge impact on the performance – wasteful warps simply sync with other warps but shouldn’t contribute to any significant overhead in the present model! Do you have any comparisons to share.

That’s just not the best use-case I’ve imagined performance wise. It of course depends on the number of __syncthreads you do in your kernel, the number of inactive threads which are forced to do it, and the complexity of the branch condition to deactivated the threads after __syncthreads() call.

I don’t have any numbers - it is just and idea.

Whatever the branch condition may be, you can always store it’s result in a flag variable.

The best way to get people to consider an idea like this is to provide a case for its usefulness. Start by developing a simple analytical model that shows how it will reduce the number of dynamic instructions required for several common operations like reductions or prefix sums. Try to look at possible problems with the idea, for example, its implications on code transformations done by the compiler or how hard it would be to design hardware to implement it. Try writing it up after this point and people will be much more likely to take a closer look at it.

k, will post later then, when I get some real numbers.

Hm, does this reduce to the halting problem? I think it does! (How do you know a priori the number of warps that have to reach the barrier?)

A tight reduction loop may be the best example of this helping. It’s the best case for this kind of optimization I can think of… very tight, syncthread dominating.

For best contrast assume worst case of 512 threads to maximize the potential savings.

You could transform this code:

tid=threadIdx.x; 

if (tid<256) val[tid]+=val[tid+256];

__syncthreads();

if (tid<128) val[tid]+=val[tid+128];

__syncthreads();

if (tid<64) val[tid]+=val[tid+64];

__syncthreads();

if (tid<32) { // no need for syncthreads for same warp which is always synced with itself

   val[tid]+=val[tid+32];

   val[tid]+=val[tid+16];

   val[tid]+=val[tid+8];

   val[tid]+=val[tid+4];

   val[tid]+=val[tid+2];

   val[tid]+=val[tid+1];

}

__syncthreads();

into this code:

tid=threadIdx.x; 

if (tid<256) {

   val[tid]+=val[tid+256];

   __syncthreads();

   if (tid<128) {

     val[tid]+=val[tid+128];

     __syncthreads();

    if (tid<64) {

        val[tid]+=val[tid+64];

       __syncthreads();

if (tid<32) { // no need for syncthreads for same warp which is always synced with itself

          val[tid]+=val[tid+32];

          val[tid]+=val[tid+16];

          val[tid]+=val[tid+8];

          val[tid]+=val[tid+4];

          val[tid]+=val[tid+2];

          val[tid]+=val[tid+1];

       }

     }

   }

__syncthreads();

What’s the savings? The warps that were disabled via the conditionals didn’t need to participate in the __syncthreads() overhead, and they don’t need to do extra test and compares.

Is this much of a savings? A bit. But if __syncthreads() costs 2 clocks per warp, this saves 8+12+14= 34 warp calls to __syncthreads(), that’s 68 clocks. It also saves 8+12=13=36 “if (tid<x)” tests, which are also 34 or 68 clocks. (I am just guessing in exact clocks here.)

Let’s try a simple cost model for a quick analysis. Say an if test and an add and a syncthreads all have similar costs of 1 “op”. Let’s assume there’s plenty of blocks so no latency or pipeline costs. Then the first reduction code, annotated with warp costs:

if (tid<256) val[tid]+=val[tid+256]; // 16+8

__syncthreads(); // 16

if (tid<128) val[tid]+=val[tid+128]; //16 + 4

__syncthreads(); // 16

if (tid<64) val[tid]+=val[tid+64]; // 16 + 2

__syncthreads(); // 16

if (tid<32) { // 16

   val[tid]+=val[tid+32]; // 1

   val[tid]+=val[tid+16]; // 1

   val[tid]+=val[tid+8];  // 1

   val[tid]+=val[tid+4]; // 1

   val[tid]+=val[tid+2]; //1

   val[tid]+=val[tid+1]; //1

}

__syncthreads(); // 16

The total is 148 warp ops.

The “relaxed syncthreads” cost:

if (tid<256) { // 16

   val[tid]+=val[tid+256]; // 8 

   __syncthreads(); //8 

   if (tid<128) { // 8

     val[tid]+=val[tid+128]; // 4

     __syncthreads(); // 4

    if (tid<64) {// 4

        val[tid]+=val[tid+64]; // 2

       __syncthreads(); //2

if (tid<32) { // 2

          val[tid]+=val[tid+32]; // 1

          val[tid]+=val[tid+16]; // 1

          val[tid]+=val[tid+8];  // 1

          val[tid]+=val[tid+4]; // 1

          val[tid]+=val[tid+2]; // 1

          val[tid]+=val[tid+1]; // 1

       }

     }

   }

__syncthreads(); // 16

The total is 80 warp-ops, versus 148. So that’s really nice… about 45% fewer ops.

Of course this is best case analysis of a the best function that could benefit, and my assumptions about op costs may not be valid, but it’s a quick analysis anyway.

However to be honest I don’t see too many other cases other than reduction where this optimization would help noticably.

Perhaps easier for NVidia (and users) would be to leave __syncthreads() alone, but add a new superoptimized __syncthreads_reduce() primitive which acts like __syncthreads_count() but does full reductions on the major types of int, float, long, and double. (Such a routine might require dynamic shared memory, which would need to be allocated like the example above implies anyway.)

I actually have another usecase where this optimization could help:

Suppose you have 2 tight kernel launches with different block configurations. Kernels are tight, that makes the kernel launch overhead to be a problem, especially when your algorithm is an iterative one, and you have to call your kernels many-many times. How much time would you save if you’d call 1 bigger kernel doing the iteration loop instead of calling 2 kernels in a loop a thousand times? I assume a lot. So, to merge 2 kernels (with different block configuration) into one I have to do some sort of block shape reconfiguration dynamically. That I plan to do by preallocating as many warps in a block as possible and with conditional branching cancel out warps that do not need to contribute in the particular sub-kernel.

In this situation __syncthreads has to be called for all the warps, including inactive ones, and depending on the number of inactive warps and the number of __syncthreads() for a particular sub-kernel called this optimization might not matter at all, or might matter (a little bit). I’ll try to get some numbers when I actually get this dynamic reconfiguration working.

In some cases (the reduce example mentioned here) you know exactly how much warps contribute at compile time, in this case a simpler version of __syncthreads(numwarps_to_sync) would work. In other completely dynamic cases a simple counter could hold number of warps taking a particular branch (maybe there is something similiar going on in hardware already).

If you’re merging two kernels like this to run simultaneously, you’d switch behaviors based on block ID, not thread ID, so you’d never have divergence.

I have some of my apps apps that use this kind of “megakernel” approach.

But your example still doesn’t make much sense, since you said your kernels have different configurations. The problem with merged kernels isn’t launch overhead, it’s the fact that the configuration is the worst case of both. If kernel1 uses 10 registers and 8K of shared memory, and kernel2 uses 63 registers and no shared memory, your final kernel will unfortunately use 8K of shared and 63 registers and therefore likely have less active blocks per SM than simply running the two kernels simultaneously.

In my case I don’t have that big resource divergence and I have dependencies between each kernel (i.e. - have to run one after another). Will post some numbers when I finish it.

Let me make a claim, which you can then support or rebut: you want a way to specify an arbitrary warp size within a kernel.

I don’t think I got your claim right, let me put it in a different way:

I think it would make sense for __syncthreads to accept an integer parameter meaning how many threads (or warps) should reach this point in the code before all of them are released for further execution. But this only makes sense in case original proposal is difficult to implement or implementation would cause additional overhead.

Given that, if your branch condition is based on thread index (for example - if (tid.x < 256) { … }), then you know how many threads (warps) take that branch, and thus you can do __syncthreads() only on threads inside that branch with the use of that parameter. In some cases this parameter may be known at compile, in others it can be computed.

There is already hardware support for this. From the PTX ISA manual:

You could implement a call to a device function along the lines of __scoped_syncthreads(), that included in-line PTX, to get this working in about 5-10 lines of code.

Thanks for the tip. Then I should be able to try it out and see how much actually the performance gain is.