CUDA compiler needs too much help in order to use select instead of branch

So I have this code:

const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float n0 = t0<0 ? 0 : t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);

Which I expect to be translated into a select instruction, not a branch instruction. Yet, the resulting PTX will show a bra for the assignment (not the dot_3d, which is inlined.)

I rewrote it with an intermediate variable like so:

const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float p0 = t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);
const float n0 = t0<0 ? 0 : p0;

… which is the same code, but the temp value is now assigned to a named variable p0.

The CUDA compiler now does the right thing: selp.f32 and no branching.

Shouldn’t it be capable of seeing this before? Or am I using the compiler wrong?

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Does the code actually execute faster after you made this change? If so, how much faster?

Branchless code is not automatically faster than code with branches. The CUDA compiler has, for years now, had pretty good and robust heuristics when to prefer select instructions or predicated code over the use of a branch. However, being heuristics, they can’t and don’t do the “right thing” in all circumstances.

Thank you,

I understand your point, but I think there is still something to consider:

Regardless whether the branch or the select is faster, the compiler could have been consistent? The code is essentially the same. Why would it generate different code, purely based on the notation in C? The same computations in C, but different instructions generated?

But yeah, maybe I am too obsessed by branchless code. I did a lot on Cell SPU and x86 AVX, and not branching made all the difference there.

Presumably there is a difference in the respective internal representations of these two code snippets which in turn result in different machine code being generated. You might want to check the intermediate PTX representation to verify that is the case.

Thanks. It was in the PTX where I spot the difference. I did not look at actual binaries.

Code comparisons should always be performed at the machine code (SASS) level. The reason for this is that the compiler backend, ptxas, is an optimizing compiler in its own right (not an assembler, contrary to what the name might suggest). PTX is basically a hybrid of a virtual instruction set architecture and an internal compiler representation in SSA form.

Besides general optimizations ptxas performs all the machine-specific optimizations based on target architecture. Because the trade-offs differ by architecture, the final decision whether to use a branch or generate branchless code is made at that stage in the compilation.

1 Like