handling thread divergence, Volta and Turing

Martini · January 19, 2020, 9:15pm

I am not sure I interpret correctly the handling of thread divergence on Volta and later, and I would greatly appreciate some hand holding.

I’m looking at the Volta white paper https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, see Figs.22 and 23.

Assume that I have an if-else statement, with the if branch executing the instructions A1, A2, A3, etc., while the else branch sees instructions B1, B2, etc.

Assume that in a warp, 16 threads take the if, and the other 16 take the else branch.

Figure 23 seems to imply that A1 is executed, after which B1 is executed, after which A2 is executed, then B2, etc. That is, instructions get interleaved.

Is this really the case? For instance, if A2 hits a global mem read, will the 16 threads about to execute B2 wait there for A2 to get executed? Also, why A1 then B1, and then A2 then B2, etc. Why not B1 then A1, and then B2 and A2, etc.

I would expect to have the scheduler issue for execution the instruction for which all operands are available, no matter if it’s an “A” family instruction or “B” family instruction. This would also scale nicely if one, for instance, has four way divergence, where there are A, B, C, and D-type instructions.

I apologize if this was answered before, I don’t quite know how to search weather a specific question like this got answered (other than reading the manual, which does a good but not perfect job explaining things).

Thanks for your time.

Robert_Crovella · January 19, 2020, 9:21pm

There isn’t any low level description or specification of instruction issue order, when the SM schedulers have multiple options (which they would, in the case of Volta and beyond, in the presence of conditional code). If your code depends on particular scheduling order, and you have taken no steps to make that happen explicitly, your code is broken.

By extension of the above statements, then, there is no statement that instructions from separate conditional execution paths get interleaved.

Martini · January 19, 2020, 10:17pm

Thanks, Robert - that clarifies it.

The documentation might benefit from an explanation like yours, perhaps added right after this blurb from the doc: “Statements from the if and else branches in the program can now be interleaved in time as shown in Figure 22”. Without your explanation, my mental image of what goes on was different and inaccurate.

Topic		Replies	Views
execution within one diverged warp CUDA Programming and Performance	2	569	February 21, 2020
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	17103	January 5, 2008
Thread Divergence, branches, examples CUDA Programming and Performance	3	2756	May 25, 2011
Single Branch Divergence? [beginner question] CUDA Programming and Performance	3	1242	January 6, 2016
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	4213	December 21, 2016
Is there thread divergence when all threads in a warp execute the same branch? General	0	623	September 19, 2021
global to shared mem loads and sync CUDA Programming and Performance	26	11862	February 21, 2008
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3082	December 28, 2008
Thread divergence due to IF CUDA Programming and Performance	3	6926	September 13, 2007
Avoid branching ... CUDA Programming and Performance	3	3723	May 19, 2010

handling thread divergence, Volta and Turing

Related topics