So I have this code:
const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float n0 = t0<0 ? 0 : t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);
Which I expect to be translated into a select instruction, not a branch instruction. Yet, the resulting PTX will show a bra for the assignment (not the dot_3d, which is inlined.)
I rewrote it with an intermediate variable like so:
const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float p0 = t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);
const float n0 = t0<0 ? 0 : p0;
… which is the same code, but the temp value is now assigned to a named variable p0.
The CUDA compiler now does the right thing: selp.f32 and no branching.
Shouldn’t it be capable of seeing this before? Or am I using the compiler wrong?
$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89