[Solved] PTX ISA predicated execution and the warp divergence issue

LowLevelKB · January 13, 2014, 1:24am

Hello everybody.

I have the question for experts in the CUDA low-level programming.

As far as I know, the warp is, putting it simply, the fixed-size (depending on the particular architecture) bunch of threads, that execute in parallel as long as each of them is performing exactly the same instruction in the same time. Otherwise they are executed sequentially.

For brevity let’s consider the following simple pieces of code in CUDA PTX and Intel x86 assemblies:

x86 code with branching:

...
cmp eax, 123
je label1
mov ebx, eax

label1:
...

x86 code without branching:

...
cmp eax, 123
cmovne ebx, eax
...

CUDA PTX code with branching:

...
setp.eq.b32 p, r1, 123
@p bra label1
mov.b32 r2, r1

label1:
...

CUDA PTX code without branching:

...
setp.eq.b32 p, r1, 123
@!p mov.b32 r2, r1
...

My question is: does the second line of the second CUDA PTX code cause threads in the warp to be divergent or (like cmovne on the x86 architecture) the whole of the line is seen as one “atomic” instruction? Well, if so, it would be the extremely powerful trick to speed up code by avoiding the penalty connected to the warp divergence.

Thanks in advance.

allanmac · January 13, 2014, 2:37am

Your intuition is correct. The CUDA C Programming Guide states that:

njuffa · January 13, 2014, 2:58am

The GPU hardware supports predication of almost all instructions. This is not a concept unique to GPUs; you may have heard that the ARM CPU architecture also allows predication of most instructions.

Conceptually you can think of predication causing an instruction to be executed but the writing of the result to be suppressed if the predicate the instruction is associated with evaluates to FALSE. The reality is more complicated but the preceding should suffice for code generation iscussions. The currently shipping GPUs have multiple predicate registers for this purpose, so different instructions can be predicated using different predicate registers.

When looking at these kind of code generation issues, you would always want to examine the machine code (SASS), rather than PTX. PTX is simply an intermediate language, and it is compiled to machine code. This second stage of compilation applies many optimizations, one of which is if-conversion, that is the conversion of conditionally executed code into either a sequence of predicated instructions or select-type instructions (CMOV on x86 is a select-ype instruction). You can dump the machine code by running cuobjdump --dump-sass.

To first order, you would want to let the compiler figure out the best way to translate branchy code, as there are many trade-offs. For example, every instruction in a sequence of predicated instructions must be executed by all threads, while this may not be necessary when a branch is used, meaning the predicated code may actually require in an increase in execution time. On modern GPUs (>= sm_20), the compiler may therefore combine predication with uniform branches (BRA.U) for best overall results. Similarly, there are tradeoffs between predicated instructions and select-type instructions. In looking at a recent optimization case on Kepler, I was wondering why the compiler used a mixture of predication and select-type instructions (instead of 100% predication), only to find that this did indeed provide the best performance.

If you write your code at PTX level, you will find that writing the code with predication does not always result in predication being used at the machine code level: sometimes a conversion to select-type instructions takes place. In some optimization cases where I disagreed with such a decision by the compiler, I have found that if I write the code with branches at PTX level I have a better chance of getting predicated instructions in the machine code. Note that this is an observation of a code generation artifact, not a guaranteed way to generate certain machine code.

LowLevelKB · January 13, 2014, 1:58pm

One more question… .

I googled a lot but, so far, I haven’t found the convincing, certain piece of knowledge about the behavior of warps I was looking for.

As far as I know, the compiler uses several strategies to avoid branching like replacing the divergent code with predicated or select-type instructions, provided that some conditions are met. Let’s suppose for a moment, that somehow the non-uniform bra instruction sneaked into the final SASS code. Now, let’s consider what’s happening if the FIRST non-uniform bra is executed. My question is: how exactly the particular threads of warp are handled? What the “sequentiallity” term does exactly mean?

There are two possible options:

CUDA in a bit "naive" way executes each thread one by one, which results in the 32x slowdown - doesn't matter how complicated structure each of branch has (i.e. if it contains subbranches, the subbranches contain other subbranches etc.);

The threads are somehow split into two smaller pieces, which run in parallel like "subwarps" and these smaller pieces are partitioned in the same way depending on how complicated the structure of each piece is, so there is 2x slowdown (best case where none of branches has subbranches) and 32x one (in the worst case like the switch block with 32 cases);

seibert · January 13, 2014, 6:20pm

The description in the CUDA Programming Guide (section 4.1 in the ver 5.0 manual) is this:

"A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. "

I read this as being equivalent to your second option.

njuffa · January 13, 2014, 7:08pm

As seibert points out with reference to the Programming Guide, it works like your second option. In terms of code optimizations, two practical questions are (1) is branchy or branchless code more likely to result in higher performance (2) where should the convergence point be after a divergent branch?

Fairly complex heuristics are used inside the compiler to resolve these questions. As with any set of heuristics, they usually provide good performance on average, but may be suboptimal for any one particular case. For example, in the absence of execution profile information fed back to the compiler, the compiler may need to make assumptions about the branch probability of particular branches.

As with all code generation issues, I would encourage the filing of bugs with repro code attached for cases where there is a proven, significant, negative performance impact. No conclusions can be drawn from PTX code in this regard, as mentioned earlier.

LowLevelKB · January 14, 2014, 1:15am

Many thanks to all for Your help.

Topic		Replies	Views
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3933	December 21, 2016
branching and SIMD processor serialization vs predication CUDA Programming and Performance	7	10695	October 26, 2007
Cost of bra instruction CUDA Programming and Performance	8	7779	January 14, 2010
Overhead of warp divergence vs. extra multiplication by 0 or 1 CUDA Programming and Performance	9	2215	February 27, 2013
How does CUDA synchronize a while loop in a kernel? CUDA Programming and Performance	5	1175	June 6, 2022
global to shared mem loads and sync CUDA Programming and Performance	26	11440	February 21, 2008
order of execution in a divergent warp CUDA Programming and Performance	27	4589	January 17, 2012
Two dispatch units in the Kepler.. is it possible execute two instructions in a warp at the same tim CUDA Programming and Performance	19	6077	April 19, 2013
Shift direction and divergence CUDA Programming and Performance	7	381	November 13, 2020
Problem with correct branching within a warp CUDA Programming and Performance	23	15634	May 28, 2009

[Solved] PTX ISA predicated execution and the warp divergence issue

Related topics