[Solved] PTX ISA predicated execution and the warp divergence issue

Hello everybody.

I have the question for experts in the CUDA low-level programming.

As far as I know, the warp is, putting it simply, the fixed-size (depending on the particular architecture) bunch of threads, that execute in parallel as long as each of them is performing exactly the same instruction in the same time. Otherwise they are executed sequentially.

For brevity let’s consider the following simple pieces of code in CUDA PTX and Intel x86 assemblies:

  • x86 code with branching:
    ...
    cmp eax, 123
    je label1
    mov ebx, eax
    
    label1:
    ...
    
  • x86 code without branching:
    ...
    cmp eax, 123
    cmovne ebx, eax
    ...
    
  • CUDA PTX code with branching:
    ...
    setp.eq.b32 p, r1, 123
    @p bra label1
    mov.b32 r2, r1
    
    label1:
    ...
    
  • CUDA PTX code without branching:
    ...
    setp.eq.b32 p, r1, 123
    @!p mov.b32 r2, r1
    ...
    
  • My question is: does the second line of the second CUDA PTX code cause threads in the warp to be divergent or (like cmovne on the x86 architecture) the whole of the line is seen as one “atomic” instruction? Well, if so, it would be the extremely powerful trick to speed up code by avoiding the penalty connected to the warp divergence.

    Thanks in advance.

    Your intuition is correct. The CUDA C Programming Guide states that:

    The GPU hardware supports predication of almost all instructions. This is not a concept unique to GPUs; you may have heard that the ARM CPU architecture also allows predication of most instructions.

    Conceptually you can think of predication causing an instruction to be executed but the writing of the result to be suppressed if the predicate the instruction is associated with evaluates to FALSE. The reality is more complicated but the preceding should suffice for code generation iscussions. The currently shipping GPUs have multiple predicate registers for this purpose, so different instructions can be predicated using different predicate registers.

    When looking at these kind of code generation issues, you would always want to examine the machine code (SASS), rather than PTX. PTX is simply an intermediate language, and it is compiled to machine code. This second stage of compilation applies many optimizations, one of which is if-conversion, that is the conversion of conditionally executed code into either a sequence of predicated instructions or select-type instructions (CMOV on x86 is a select-ype instruction). You can dump the machine code by running cuobjdump --dump-sass.

    To first order, you would want to let the compiler figure out the best way to translate branchy code, as there are many trade-offs. For example, every instruction in a sequence of predicated instructions must be executed by all threads, while this may not be necessary when a branch is used, meaning the predicated code may actually require in an increase in execution time. On modern GPUs (>= sm_20), the compiler may therefore combine predication with uniform branches (BRA.U) for best overall results. Similarly, there are tradeoffs between predicated instructions and select-type instructions. In looking at a recent optimization case on Kepler, I was wondering why the compiler used a mixture of predication and select-type instructions (instead of 100% predication), only to find that this did indeed provide the best performance.

    If you write your code at PTX level, you will find that writing the code with predication does not always result in predication being used at the machine code level: sometimes a conversion to select-type instructions takes place. In some optimization cases where I disagreed with such a decision by the compiler, I have found that if I write the code with branches at PTX level I have a better chance of getting predicated instructions in the machine code. Note that this is an observation of a code generation artifact, not a guaranteed way to generate certain machine code.

    One more question… .

    I googled a lot but, so far, I haven’t found the convincing, certain piece of knowledge about the behavior of warps I was looking for.

    As far as I know, the compiler uses several strategies to avoid branching like replacing the divergent code with predicated or select-type instructions, provided that some conditions are met. Let’s suppose for a moment, that somehow the non-uniform bra instruction sneaked into the final SASS code. Now, let’s consider what’s happening if the FIRST non-uniform bra is executed. My question is: how exactly the particular threads of warp are handled? What the “sequentiallity” term does exactly mean?

    There are two possible options:

  • CUDA in a bit "naive" way executes each thread one by one, which results in the 32x slowdown - doesn't matter how complicated structure each of branch has (i.e. if it contains subbranches, the subbranches contain other subbranches etc.);
  • The threads are somehow split into two smaller pieces, which run in parallel like "subwarps" and these smaller pieces are partitioned in the same way depending on how complicated the structure of each piece is, so there is 2x slowdown (best case where none of branches has subbranches) and 32x one (in the worst case like the switch block with 32 cases);
  • The description in the CUDA Programming Guide (section 4.1 in the ver 5.0 manual) is this:

    "A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. "

    I read this as being equivalent to your second option.

    As seibert points out with reference to the Programming Guide, it works like your second option. In terms of code optimizations, two practical questions are (1) is branchy or branchless code more likely to result in higher performance (2) where should the convergence point be after a divergent branch?

    Fairly complex heuristics are used inside the compiler to resolve these questions. As with any set of heuristics, they usually provide good performance on average, but may be suboptimal for any one particular case. For example, in the absence of execution profile information fed back to the compiler, the compiler may need to make assumptions about the branch probability of particular branches.

    As with all code generation issues, I would encourage the filing of bugs with repro code attached for cases where there is a proven, significant, negative performance impact. No conclusions can be drawn from PTX code in this regard, as mentioned earlier.

    Many thanks to all for Your help.