CUDA compiler needs too much help in order to use select instead of branch

stolk · June 11, 2020, 8:33pm

So I have this code:

const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float n0 = t0<0 ? 0 : t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);

Which I expect to be translated into a select instruction, not a branch instruction. Yet, the resulting PTX will show a bra for the assignment (not the dot_3d, which is inlined.)

I rewrote it with an intermediate variable like so:

const float t0 = 0.6f - x0*x0 - y0*y0 - z0*z0;
const float p0 = t0*t0*t0*t0 * dot_3d(grad0_x, grad0_y, grad0_z, x0, y0, z0);
const float n0 = t0<0 ? 0 : p0;

… which is the same code, but the temp value is now assigned to a named variable p0.

The CUDA compiler now does the right thing: selp.f32 and no branching.

Shouldn’t it be capable of seeing this before? Or am I using the compiler wrong?

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

njuffa · June 11, 2020, 9:37pm

Does the code actually execute faster after you made this change? If so, how much faster?

Branchless code is not automatically faster than code with branches. The CUDA compiler has, for years now, had pretty good and robust heuristics when to prefer select instructions or predicated code over the use of a branch. However, being heuristics, they can’t and don’t do the “right thing” in all circumstances.

stolk · June 11, 2020, 10:35pm

Thank you,

I understand your point, but I think there is still something to consider:

Regardless whether the branch or the select is faster, the compiler could have been consistent? The code is essentially the same. Why would it generate different code, purely based on the notation in C? The same computations in C, but different instructions generated?

But yeah, maybe I am too obsessed by branchless code. I did a lot on Cell SPU and x86 AVX, and not branching made all the difference there.

njuffa · June 11, 2020, 10:41pm

Presumably there is a difference in the respective internal representations of these two code snippets which in turn result in different machine code being generated. You might want to check the intermediate PTX representation to verify that is the case.

stolk · June 11, 2020, 11:11pm

Thanks. It was in the PTX where I spot the difference. I did not look at actual binaries.

njuffa · June 11, 2020, 11:16pm

Code comparisons should always be performed at the machine code (SASS) level. The reason for this is that the compiler backend, ptxas, is an optimizing compiler in its own right (not an assembler, contrary to what the name might suggest). PTX is basically a hybrid of a virtual instruction set architecture and an internal compiler representation in SSA form.

Besides general optimizations ptxas performs all the machine-specific optimizations based on target architecture. Because the trade-offs differ by architecture, the final decision whether to use a branch or generate branchless code is made at that stage in the compilation.

Topic		Replies	Views
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	2957	January 14, 2014
Branching Performance Hit CUDA Programming and Performance	15	2667	June 30, 2009
Strange PTX Output CUDA Programming and Performance	9	3293	December 19, 2014
Programming CUDA at 'assembler' level? CUDA Programming and Performance	9	13430	November 7, 2010
CUDA 5.5 produces different PTX code from CUDA 5.0 with degraded performance CUDA Programming and Performance	3	1437	August 15, 2013
Why a division by constant is not replaced by a multiplication by constant CUDA NVCC Compiler cuda , nvcc	4	514	November 16, 2023
Cost of bra instruction CUDA Programming and Performance	8	7759	January 14, 2010
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3920	December 21, 2016
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4225	May 24, 2018
Does PTX support double sin() and cos()? CUDA Programming and Performance	4	1492	November 17, 2014

CUDA compiler needs too much help in order to use select instead of branch

Related topics