threads diverging in a loop when does a loop cause divergance

codercat · April 18, 2007, 9:11am

I understand how threads in a warp can diverge in the case of an if or switch statement, however I am a little unclear about how threads may diverge because of a loop. If we have a simple for, do or while loop in which the exit condition is met at the same time for each thread in a warp, will the threads diverge? Or will they diverge if and only if the exit condition ends up being met at different times for each thread.

For example:

int i = 0;

do

{

     somestuff;

     i++;

}while(i < 10);

assuming ‘i’ is never touched other than to incriment it at the end of the loop, will this code ever diverge? Does it work the same way for ‘for’ loops and ‘while’ loops?

Thanks.

prkipfer · April 18, 2007, 9:29am

The thread will not diverge if somestuff() doesn’t break or continue. Inside somestuff() there may be divergent parts but they will be reunited once somestuff() finishes.

Peter

osiris1 · April 23, 2007, 3:27am

I wonder if someone could clarify the exact conditions under which 2 divergent subsets of threads in a warp with coalesce? Is it happening to end up with the same PC while blocked (eg for a global memory access) or does one need to issue a synchronise?
TIA, Eric
(I would put money on it not being the optimal - picking up stalled threads on the first cycle that PCs match as this would have to cost silicon)

prkipfer · April 23, 2007, 4:41pm

The manual says, the threads are “serialized in hardware”. I think this is just the same method we are used to from previous GPU generations (output reg write masks).

Peter

wumpus · April 23, 2007, 4:46pm

As the iteration number (10) is independent of the thread, this will never diverge. If you would count to thread.id instead of 10, it would.

You don’t have to resynchronize manually if threads do diverge, the hardware handles this by just blocking part of the threads until they coalesce again.

osiris1 · April 24, 2007, 1:51am

Maybe I should have started another thread… my reading of section 6.1.1.2 give lots of detail on the conditions under which serialisation will occur but no detail on exactly when coalescing happens. The worst case scenario where all 32 threads are divergent (and NOT using branch predication) would require parallel compares on all 32 PCs every clock cycle if it was to automatically pick up - and this might not be most desirable anyway - better performance might be had by the compiler inserting something into the instruction stream to help drag subsets back together. Seems another paragraph is needed at the end of section 6.1.1.2.

I am trying to estimate the hit I am going to take in a recursive routine with half a dozen paths that are selected at random for each thread. Now I need to recode recursion into a local stack and then work out what sort of thread utilisation I am likely to get.

Thx, Eric

ed: There is another thread with lots more info - hiding behind an uninformative subject! [url=“http://forums.nvidia.com/index.php?showtopic=29863”]http://forums.nvidia.com/index.php?showtopic=29863[/url]

osiris1 · April 25, 2007, 6:32am

This leaves one more question - given the compiler does output what sounds like partial syncs at merge points - does it always get it exactly right? The compiler does not know how many threads are going to end up at any one merge point at compile time… (If we could use them too, explicitly, that might be useful)
Thx, Eric

prkipfer · April 25, 2007, 10:26am

As I said above, the diverging threads are serialized/rejoined in hardware. The compiler does not care about that at all.

Peter

osiris1 · April 25, 2007, 12:58pm

Sorry Peter I was referring to [url=“http://forums.nvidia.com/index.php?showtopic=29863”]The Official NVIDIA Forums | NVIDIA post#5 where Mark Harris says “The threads in your new example will converge after the return from runb(). The compiler provides information in the code to help the hardware know when threads may diverge and converge.”

I am trying to understand how it works so that I don’t program myself into a corner without realising it. Is there a whitepaper on this feature? I feel like I have a model for everything else except this. Whichever way I look at it either some threads never pick up or code that does not need to be executed by any gets stepped over by everyone (esp in the presence of “goto”).
Thx, Eric

prkipfer · April 25, 2007, 1:12pm

I think Mark was talking about the compiler in the driver, not the high-level C compiler, ie. you cannot influence that anyway.

Peter

osiris1 · April 26, 2007, 3:28am

I wonder if a G80 engineer from Nvidia can answer my original question… I am trying to see how it can be other than a warp gradually fragments more and more at each per thread dependent branch until one gets to a point that the compiler knows, using path analysis, that all threads should be back together (such as a return, or the last join before a return if there is only one) whence it inserts a warp level sync. There could possibly be accidental rejoins before, should PCs happen to match. If there is a whitepaper or even a patent number explaining how it works, if it is not the obvious, that would be most appreciated.
Eric

ed: refining my guess: at the entry to every control structure which may cause divergence (decision variable is thread dependent) the current running mask of threads is saved and at the end of the structure where all threads should be together a sync with that same thread mask is performed. The compiler inserts the save and sync opcodes as it can determine these points using path analysis.

further ed: I think this would mean that if you use “goto” you could shoot yourself in the foot… as the compiler would not be able to do its thing for all nested control structures between the source and destination levels of the jump. Who uses “goto” I hear you say, well personally I think break(n) and continue(n) are missing from the C language spec and it is OK to use it for these. Using “return” from within nested structs is equivalent to a “goto” the outermost “return” which is much more common.

osiris1 · May 4, 2007, 8:49am

The silence is deafening… the hardware does it, just does not do it for me External Image

The reason seems to be that the G80 architecture is not ideal in this regard and no-one is allowed to talk about it. It looks like the algorithm is a heuristic that gets it exactly right some of the time then degrades the rest. The only info the hardware seems to have is a bra.uni (presumably branch and unify) instruction to the convergence point in all converging paths of a conditional that could cause divergence. One path does not, the one that just runs on through.

Given this info the only algorithm I could come up with is: all threads go into a wait state at the target address for bra.uni. If the PC matches at any time in the future then we have convergence and these threads pick up. If the whole warp stalls (some threads maybe reach a sync and everyone else ends up in a wait state) then if there are any of these waiting threads one lot is let go until everyone is up to the sync.

Can anyone think of another way that this could work?

Eric

osiris1 · May 4, 2007, 10:05pm

Thought that would get a nibble…

Is it as simple as the semantics of bra.uni are to set that thread’s PC to the target and pick up with those threads that are waiting with the lowest PC? Seems like that could get it right, even in the presence of forward and backward gotos. If so any overhead in finding those threads?
Eric

osiris1 · May 12, 2007, 1:43am

a final round: originally this question started off similar to the “Putting GPU to work” topic…
Just came back to this topic and my next design is - once you have hardware that can select the min PC to resume within a clock then the semantics of ALL branch instructions would be to store the target and then pick up with the lowest PC. Gets it right all the time without any compiler input except correct ordering of multiple paths in memory. Everything inlined makes this work nicely.

Anyone a better idea? Way to design an experiment to see if G80 works this way?
Thanks, Eric

Topic		Replies	Views
Question about divergence and loops CUDA Programming and Performance	7	7189	November 21, 2010
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21376	March 12, 2007
Thread Divergence CUDA Programming and Performance	2	2798	September 27, 2008
Question about control flow divergence CUDA Programming and Performance	4	7391	July 24, 2008
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2987	December 13, 2011
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	4082	December 21, 2016
Thread Divergence CUDA Programming and Performance	5	2822	June 1, 2010
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3040	December 28, 2008
Single Branch Divergence? [beginner question] CUDA Programming and Performance	3	1222	January 6, 2016
Shift direction and divergence CUDA Programming and Performance	7	470	November 13, 2020

threads diverging in a loop when does a loop cause divergance

Related topics