The GPU hardware supports predication of almost all instructions. This is not a concept unique to GPUs; you may have heard that the ARM CPU architecture also allows predication of most instructions.
Conceptually you can think of predication causing an instruction to be executed but the writing of the result to be suppressed if the predicate the instruction is associated with evaluates to FALSE. The reality is more complicated but the preceding should suffice for code generation iscussions. The currently shipping GPUs have multiple predicate registers for this purpose, so different instructions can be predicated using different predicate registers.
When looking at these kind of code generation issues, you would always want to examine the machine code (SASS), rather than PTX. PTX is simply an intermediate language, and it is compiled to machine code. This second stage of compilation applies many optimizations, one of which is if-conversion, that is the conversion of conditionally executed code into either a sequence of predicated instructions or select-type instructions (CMOV on x86 is a select-ype instruction). You can dump the machine code by running cuobjdump --dump-sass.
To first order, you would want to let the compiler figure out the best way to translate branchy code, as there are many trade-offs. For example, every instruction in a sequence of predicated instructions must be executed by all threads, while this may not be necessary when a branch is used, meaning the predicated code may actually require in an increase in execution time. On modern GPUs (>= sm_20), the compiler may therefore combine predication with uniform branches (BRA.U) for best overall results. Similarly, there are tradeoffs between predicated instructions and select-type instructions. In looking at a recent optimization case on Kepler, I was wondering why the compiler used a mixture of predication and select-type instructions (instead of 100% predication), only to find that this did indeed provide the best performance.
If you write your code at PTX level, you will find that writing the code with predication does not always result in predication being used at the machine code level: sometimes a conversion to select-type instructions takes place. In some optimization cases where I disagreed with such a decision by the compiler, I have found that if I write the code with branches at PTX level I have a better chance of getting predicated instructions in the machine code. Note that this is an observation of a code generation artifact, not a guaranteed way to generate certain machine code.