I wonder if a G80 engineer from Nvidia can answer my original question… I am trying to see how it can be other than a warp gradually fragments more and more at each per thread dependent branch until one gets to a point that the compiler knows, using path analysis, that all threads should be back together (such as a return, or the last join before a return if there is only one) whence it inserts a warp level sync. There could possibly be accidental rejoins before, should PCs happen to match. If there is a whitepaper or even a patent number explaining how it works, if it is not the obvious, that would be most appreciated.
ed: refining my guess: at the entry to every control structure which may cause divergence (decision variable is thread dependent) the current running mask of threads is saved and at the end of the structure where all threads should be together a sync with that same thread mask is performed. The compiler inserts the save and sync opcodes as it can determine these points using path analysis.
further ed: I think this would mean that if you use “goto” you could shoot yourself in the foot… as the compiler would not be able to do its thing for all nested control structures between the source and destination levels of the jump. Who uses “goto” I hear you say, well personally I think break(n) and continue(n) are missing from the C language spec and it is OK to use it for these. Using “return” from within nested structs is equivalent to a “goto” the outermost “return” which is much more common.