I am not sure I interpret correctly the handling of thread divergence on Volta and later, and I would greatly appreciate some hand holding.
I’m looking at the Volta white paper https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, see Figs.22 and 23.
Assume that I have an if-else statement, with the if branch executing the instructions A1, A2, A3, etc., while the else branch sees instructions B1, B2, etc.
Assume that in a warp, 16 threads take the if, and the other 16 take the else branch.
Figure 23 seems to imply that A1 is executed, after which B1 is executed, after which A2 is executed, then B2, etc. That is, instructions get interleaved.
Is this really the case? For instance, if A2 hits a global mem read, will the 16 threads about to execute B2 wait there for A2 to get executed? Also, why A1 then B1, and then A2 then B2, etc. Why not B1 then A1, and then B2 and A2, etc.
I would expect to have the scheduler issue for execution the instruction for which all operands are available, no matter if it’s an “A” family instruction or “B” family instruction. This would also scale nicely if one, for instance, has four way divergence, where there are A, B, C, and D-type instructions.
I apologize if this was answered before, I don’t quite know how to search weather a specific question like this got answered (other than reading the manual, which does a good but not perfect job explaining things).
Thanks for your time.