Which I expect to be translated into a select instruction, not a branch instruction. Yet, the resulting PTX will show a bra for the assignment (not the dot_3d, which is inlined.)
I rewrote it with an intermediate variable like so:
Does the code actually execute faster after you made this change? If so, how much faster?
Branchless code is not automatically faster than code with branches. The CUDA compiler has, for years now, had pretty good and robust heuristics when to prefer select instructions or predicated code over the use of a branch. However, being heuristics, they can’t and don’t do the “right thing” in all circumstances.
I understand your point, but I think there is still something to consider:
Regardless whether the branch or the select is faster, the compiler could have been consistent? The code is essentially the same. Why would it generate different code, purely based on the notation in C? The same computations in C, but different instructions generated?
But yeah, maybe I am too obsessed by branchless code. I did a lot on Cell SPU and x86 AVX, and not branching made all the difference there.
Presumably there is a difference in the respective internal representations of these two code snippets which in turn result in different machine code being generated. You might want to check the intermediate PTX representation to verify that is the case.
Code comparisons should always be performed at the machine code (SASS) level. The reason for this is that the compiler backend, ptxas, is an optimizing compiler in its own right (not an assembler, contrary to what the name might suggest). PTX is basically a hybrid of a virtual instruction set architecture and an internal compiler representation in SSA form.
Besides general optimizations ptxas performs all the machine-specific optimizations based on target architecture. Because the trade-offs differ by architecture, the final decision whether to use a branch or generate branchless code is made at that stage in the compilation.