What methods are there to tell the compiler that an if statement is likely to be false, thus compiling more optimized code for better performance?

SparkHu · May 9, 2024, 11:17am

I have a piece of code that looks similar to the following:

constexpr uint32_t NUM = 70;
uint32_t data[NUM];
//Initialize data in another for loop
//Calculate
for (uint32_t i = 0; i < NUM; i++) {
      if (data[i] != 0) {
              //Calculate
      }
}

The code inside the if statement body will be compiled into 5 IMAD instructions, 3 LOP3 instructions, and 1 STL instruction. Since the for loop executes a fixed number of times, the entire for loop is fully unrolled. Within the unrolled for loop body, the code in the initial iterations of the loop is compiled into the following form:

00007f18 83299ec0	      ISETP.NE.AND P1, PT, R105, RZ, PT 
00007f18 83299ed0	      BSSY B0, 0x7f1883299fe0 
00007f18 83299ee0	      ISETP.NE.AND P2, PT, R104, RZ, PT 
00007f18 83299ef0	      IMAD.IADD R2, R13, 0x1, -R4 
00007f18 83299f00	      ISETP.NE.AND P3, PT, R103, RZ, PT 
00007f18 83299f10	      ISETP.NE.AND P6, PT, R102, RZ, PT 
00007f18 83299f20	      ISETP.NE.AND P0, PT, R87, RZ, PT 
00007f18 83299f30	@!P1  BRA 0x7f1883299fd0 
00007f18 83299f40	      IMAD.IADD R88, R14, 0x1, R15 
00007f18 83299f50	      IMAD R89, R2, 0x10, R105 
00007f18 83299f60	      LOP3.LUT R105, R105, 0xf, RZ, 0xc0, !PT 
00007f18 83299f70	      IMAD.U32 R88, R88, 0x10000, RZ 
00007f18 83299f80	      IMAD R91, R8, 0x4, R1 
00007f18 83299f90	      IADD3 R8, R8, 0x1, RZ 
00007f18 83299fa0	      IMAD.IADD R6, R6, 0x1, R105 
00007f18 83299fb0	      LOP3.LUT R88, R89, R88, RZ, 0xfc, !PT 
00007f18 83299fc0	      STL [R91], R88 
00007f18 83299fd0	      BSYNC B0 
00007f18 83299fe0	      BSSY B0, 0x7f188329a0a0 
00007f18 83299ff0	@!P2  BRA 0x7f188329a090

It can be observed from the compiled sass code that each iteration of the loop is completed before the next iteration begins.However, after several iterations, the compiler starts using predicated registers to conditionally execute instructions within the loop body and parallelize multiple loop bodies, as shown below:

00007f18 8329a6d0	@P0   IMAD R87, R2, 0x10, R87 
00007f18 8329a6e0	@P3   LOP3.LUT R15, R84, 0xf, RZ, 0xc0, !PT 
00007f18 8329a6f0	@P4   IMAD R86, R2, 0x10, R86 
00007f18 8329a700	@P5   IADD3 R91, R8, 0x1, RZ 
00007f18 8329a710	@P5   IMAD R92, R8, 0x4, R1 
00007f18 8329a720	@P0   LOP3.LUT R87, R87, R88, RZ, 0xfc, !PT 
00007f18 8329a730	@P3   IMAD.IADD R6, R6, 0x1, R15 
00007f18 8329a740	@P4   LOP3.LUT R86, R86, R88, RZ, 0xfc, !PT 
00007f18 8329a750	@P5   IMAD.MOV.U32 R8, RZ, RZ, R91 
00007f18 8329a760	@P2   LOP3.LUT R15, R83, 0xf, RZ, 0xc0, !PT 
00007f18 8329a770	@P0   STL [R90], R87

As in the vast majority of cases, the values of these P registers are false, this results in the GPU executing a large number of useless instructions (my understanding is that even if the P registers of every thread in a warp are false, the instructions still need to be executed as if one of the thread’s P register values is true, which means issuing instructions and occupying pipelines, etc. Is this understanding correct?). Consequently, it reduces performance. Is there any way to prevent the compiler from adopting the optimization method described above?

Cuda toolkit version: Cuda compilation tools, release 12.3, V12.3.107

Greg · May 9, 2024, 4:36pm

You are likely to get a better response on the CUDA NVCC Compiler - NVIDIA Developer Forums.

The compiler has heuristics to decide between divergent branches and predicated execution. Quick testing of C++ 20 [[likely]] and [[unlikely]] compiled but did not impact generated SASS. The compiler team is more likely to be able to provide you information on attributes or pragmas.

Pipeline occupancy for fixed latency pipes such as fma and alu take the same number of cycles to execute an a warp with all threads active and predicated true as it does for if all threads are predicated off. Pipeline occupancy for variable latency pipes such as lsu and texture can reduce pipe utilization, but not eliminate, if all threads are inactive or predicated off.

SparkHu · May 22, 2024, 2:48am

Does anyone know how to solve this problem?

paleonix · April 24, 2025, 12:52pm

__builtin_expect() seems like the right tool for this, but similar to [[likely]] this is a hint and could be ignored.

Topic		Replies	Views
Different output of code when not unrolling loop CUDA Programming and Performance	16	1075	August 22, 2022
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3247	January 20, 2019
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2427	October 25, 2017
Uint64_t result evaluation & storage eats up 25% of kernel performance CUDA Programming and Performance cuda , kernel	28	989	October 3, 2023
why adding 1 line =exploding time to compile CUDA Programming and Performance	13	8448	June 8, 2009
Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction? CUDA Programming and Performance	18	49	March 2, 2025
ran out of registers in predicate error CUDA Programming and Performance	8	1199	August 18, 2010
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	462	March 3, 2024
Running partly empty blocks CUDA Programming and Performance	17	17236	October 2, 2007
Compiler Bug ? Position of statement causes program to fail! The statement is "isolated" CUDA Programming and Performance	8	1617	November 10, 2009

What methods are there to tell the compiler that an if statement is likely to be false, thus compiling more optimized code for better performance?

Related topics