threads diverging in a loop when does a loop cause divergance

I understand how threads in a warp can diverge in the case of an if or switch statement, however I am a little unclear about how threads may diverge because of a loop. If we have a simple for, do or while loop in which the exit condition is met at the same time for each thread in a warp, will the threads diverge? Or will they diverge if and only if the exit condition ends up being met at different times for each thread.

For example:

int i = 0;

do

{

     somestuff;

     i++;

}while(i < 10);

assuming ‘i’ is never touched other than to incriment it at the end of the loop, will this code ever diverge? Does it work the same way for ‘for’ loops and ‘while’ loops?

Thanks.

The thread will not diverge if somestuff() doesn’t break or continue. Inside somestuff() there may be divergent parts but they will be reunited once somestuff() finishes.

Peter

I wonder if someone could clarify the exact conditions under which 2 divergent subsets of threads in a warp with coalesce? Is it happening to end up with the same PC while blocked (eg for a global memory access) or does one need to issue a synchronise?
TIA, Eric
(I would put money on it not being the optimal - picking up stalled threads on the first cycle that PCs match as this would have to cost silicon)

The manual says, the threads are “serialized in hardware”. I think this is just the same method we are used to from previous GPU generations (output reg write masks).

Peter

As the iteration number (10) is independent of the thread, this will never diverge. If you would count to thread.id instead of 10, it would.

You don’t have to resynchronize manually if threads do diverge, the hardware handles this by just blocking part of the threads until they coalesce again.

Maybe I should have started another thread… my reading of section 6.1.1.2 give lots of detail on the conditions under which serialisation will occur but no detail on exactly when coalescing happens. The worst case scenario where all 32 threads are divergent (and NOT using branch predication) would require parallel compares on all 32 PCs every clock cycle if it was to automatically pick up - and this might not be most desirable anyway - better performance might be had by the compiler inserting something into the instruction stream to help drag subsets back together. Seems another paragraph is needed at the end of section 6.1.1.2.

I am trying to estimate the hit I am going to take in a recursive routine with half a dozen paths that are selected at random for each thread. Now I need to recode recursion into a local stack and then work out what sort of thread utilisation I am likely to get.

Thx, Eric

ed: There is another thread with lots more info - hiding behind an uninformative subject! [url=“http://forums.nvidia.com/index.php?showtopic=29863”]http://forums.nvidia.com/index.php?showtopic=29863[/url]

This leaves one more question - given the compiler does output what sounds like partial syncs at merge points - does it always get it exactly right? The compiler does not know how many threads are going to end up at any one merge point at compile time… (If we could use them too, explicitly, that might be useful)
Thx, Eric

As I said above, the diverging threads are serialized/rejoined in hardware. The compiler does not care about that at all.

Peter

Sorry Peter I was referring to [url=“http://forums.nvidia.com/index.php?showtopic=29863”]The Official NVIDIA Forums | NVIDIA post#5 where Mark Harris says “The threads in your new example will converge after the return from runb(). The compiler provides information in the code to help the hardware know when threads may diverge and converge.”

I am trying to understand how it works so that I don’t program myself into a corner without realising it. Is there a whitepaper on this feature? I feel like I have a model for everything else except this. Whichever way I look at it either some threads never pick up or code that does not need to be executed by any gets stepped over by everyone (esp in the presence of “goto”).
Thx, Eric

I think Mark was talking about the compiler in the driver, not the high-level C compiler, ie. you cannot influence that anyway.

Peter

I wonder if a G80 engineer from Nvidia can answer my original question… I am trying to see how it can be other than a warp gradually fragments more and more at each per thread dependent branch until one gets to a point that the compiler knows, using path analysis, that all threads should be back together (such as a return, or the last join before a return if there is only one) whence it inserts a warp level sync. There could possibly be accidental rejoins before, should PCs happen to match. If there is a whitepaper or even a patent number explaining how it works, if it is not the obvious, that would be most appreciated.
Eric

ed: refining my guess: at the entry to every control structure which may cause divergence (decision variable is thread dependent) the current running mask of threads is saved and at the end of the structure where all threads should be together a sync with that same thread mask is performed. The compiler inserts the save and sync opcodes as it can determine these points using path analysis.

further ed: I think this would mean that if you use “goto” you could shoot yourself in the foot… as the compiler would not be able to do its thing for all nested control structures between the source and destination levels of the jump. Who uses “goto” I hear you say, well personally I think break(n) and continue(n) are missing from the C language spec and it is OK to use it for these. Using “return” from within nested structs is equivalent to a “goto” the outermost “return” which is much more common.

The silence is deafening… the hardware does it, just does not do it for me External Image

The reason seems to be that the G80 architecture is not ideal in this regard and no-one is allowed to talk about it. It looks like the algorithm is a heuristic that gets it exactly right some of the time then degrades the rest. The only info the hardware seems to have is a bra.uni (presumably branch and unify) instruction to the convergence point in all converging paths of a conditional that could cause divergence. One path does not, the one that just runs on through.

Given this info the only algorithm I could come up with is: all threads go into a wait state at the target address for bra.uni. If the PC matches at any time in the future then we have convergence and these threads pick up. If the whole warp stalls (some threads maybe reach a sync and everyone else ends up in a wait state) then if there are any of these waiting threads one lot is let go until everyone is up to the sync.

Can anyone think of another way that this could work?

Eric

Thought that would get a nibble…

Is it as simple as the semantics of bra.uni are to set that thread’s PC to the target and pick up with those threads that are waiting with the lowest PC? Seems like that could get it right, even in the presence of forward and backward gotos. If so any overhead in finding those threads?
Eric

a final round: originally this question started off similar to the “Putting GPU to work” topic…
Just came back to this topic and my next design is - once you have hardware that can select the min PC to resume within a clock then the semantics of ALL branch instructions would be to store the target and then pick up with the lowest PC. Gets it right all the time without any compiler input except correct ordering of multiple paths in memory. Everything inlined makes this work nicely.

Anyone a better idea? Way to design an experiment to see if G80 works this way?
Thanks, Eric