Predicate propagation

I have been reading the PTX manual and find the predicate portion of the assembler instructions to be very useful. I can see how it would be easy to use a predicate register to prevent a branch around 5 instructions so every thread stays concurrent. But I am wondering if this can be generated by the nvcc compiler. The standard syntax of if (a< B ) blah; if (a>= B ) bleh; generates single lines of code, but if “blah” and “bleh” are several instructions, with the compiler be smart enough to predicate all the instructions within the block of code?

Edit - fix smily face

Depends how many instructions are there. IIRC it should always predicate for <= 4 instructions within a conditional block or even <=7 if it decides it’d be better (dark compiler magic). I’m not a compiler guru though.

If there are more instructions in a block, there’ll be a conditional jump. It’s actually better to jump when you have a large block of code. If you’re predicating, all threads will always execute all control paths (masking themselves off for non-taken branches). If you’re jumping, it might happen so that all threads evaluate to a single control path and there’ll only be a single jump. It’s a balancing act between paying for a jump instruction (potentially several) and paying for several arithmetic/logical instructions that may be executed even when there’s no need.

For highly divergent code, when a warp will have to execute most control paths sequentially anyway, it’s better to predicate because you don’t get the jump instruction overhead. For rarely diverging code with big conditional blocks it can be better to jump, because what you pay in jump instruction overhead you might gain in not following non-taken control paths.

Analyze the actual machine code using decuda. I asked this same question once very long ago and was told that the conversion of if jumps to predicated instructions was done at the ptxas stage so that I would never observe such in the ptx output from nvcc. That was in reference to CUDA 1.0 or so, I have no idea if it still holds in the CUDA 3.0 era.

Thank you for the answers. Sounds like the compiler is pretty smart, so as long as I keep the number of statements in a predicate zone small I should be ok. I definitely want to try and write each thread to be sequential and have any decision blocks be in control sections. It seems to be very DSP like, and that is something I’m used to.