What methods are there to tell the compiler that an if statement is likely to be false, thus compiling more optimized code for better performance?

I have a piece of code that looks similar to the following:

constexpr uint32_t NUM = 70;
uint32_t data[NUM];
//Initialize data in another for loop
//Calculate
for (uint32_t i = 0; i < NUM; i++) {
      if (data[i] != 0) {
              //Calculate
      }
}

The code inside the if statement body will be compiled into 5 IMAD instructions, 3 LOP3 instructions, and 1 STL instruction. Since the for loop executes a fixed number of times, the entire for loop is fully unrolled. Within the unrolled for loop body, the code in the initial iterations of the loop is compiled into the following form:

00007f18 83299ec0	      ISETP.NE.AND P1, PT, R105, RZ, PT 
00007f18 83299ed0	      BSSY B0, 0x7f1883299fe0 
00007f18 83299ee0	      ISETP.NE.AND P2, PT, R104, RZ, PT 
00007f18 83299ef0	      IMAD.IADD R2, R13, 0x1, -R4 
00007f18 83299f00	      ISETP.NE.AND P3, PT, R103, RZ, PT 
00007f18 83299f10	      ISETP.NE.AND P6, PT, R102, RZ, PT 
00007f18 83299f20	      ISETP.NE.AND P0, PT, R87, RZ, PT 
00007f18 83299f30	@!P1  BRA 0x7f1883299fd0 
00007f18 83299f40	      IMAD.IADD R88, R14, 0x1, R15 
00007f18 83299f50	      IMAD R89, R2, 0x10, R105 
00007f18 83299f60	      LOP3.LUT R105, R105, 0xf, RZ, 0xc0, !PT 
00007f18 83299f70	      IMAD.U32 R88, R88, 0x10000, RZ 
00007f18 83299f80	      IMAD R91, R8, 0x4, R1 
00007f18 83299f90	      IADD3 R8, R8, 0x1, RZ 
00007f18 83299fa0	      IMAD.IADD R6, R6, 0x1, R105 
00007f18 83299fb0	      LOP3.LUT R88, R89, R88, RZ, 0xfc, !PT 
00007f18 83299fc0	      STL [R91], R88 
00007f18 83299fd0	      BSYNC B0 
00007f18 83299fe0	      BSSY B0, 0x7f188329a0a0 
00007f18 83299ff0	@!P2  BRA 0x7f188329a090 

It can be observed from the compiled sass code that each iteration of the loop is completed before the next iteration begins.However, after several iterations, the compiler starts using predicated registers to conditionally execute instructions within the loop body and parallelize multiple loop bodies, as shown below:

00007f18 8329a6d0	@P0   IMAD R87, R2, 0x10, R87 
00007f18 8329a6e0	@P3   LOP3.LUT R15, R84, 0xf, RZ, 0xc0, !PT 
00007f18 8329a6f0	@P4   IMAD R86, R2, 0x10, R86 
00007f18 8329a700	@P5   IADD3 R91, R8, 0x1, RZ 
00007f18 8329a710	@P5   IMAD R92, R8, 0x4, R1 
00007f18 8329a720	@P0   LOP3.LUT R87, R87, R88, RZ, 0xfc, !PT 
00007f18 8329a730	@P3   IMAD.IADD R6, R6, 0x1, R15 
00007f18 8329a740	@P4   LOP3.LUT R86, R86, R88, RZ, 0xfc, !PT 
00007f18 8329a750	@P5   IMAD.MOV.U32 R8, RZ, RZ, R91 
00007f18 8329a760	@P2   LOP3.LUT R15, R83, 0xf, RZ, 0xc0, !PT 
00007f18 8329a770	@P0   STL [R90], R87 

As in the vast majority of cases, the values of these P registers are false, this results in the GPU executing a large number of useless instructions (my understanding is that even if the P registers of every thread in a warp are false, the instructions still need to be executed as if one of the thread’s P register values is true, which means issuing instructions and occupying pipelines, etc. Is this understanding correct?). Consequently, it reduces performance. Is there any way to prevent the compiler from adopting the optimization method described above?

Cuda toolkit version: Cuda compilation tools, release 12.3, V12.3.107

You are likely to get a better response on the CUDA NVCC Compiler - NVIDIA Developer Forums.

The compiler has heuristics to decide between divergent branches and predicated execution. Quick testing of C++ 20 [[likely]] and [[unlikely]] compiled but did not impact generated SASS. The compiler team is more likely to be able to provide you information on attributes or pragmas.

Pipeline occupancy for fixed latency pipes such as fma and alu take the same number of cycles to execute an a warp with all threads active and predicated true as it does for if all threads are predicated off. Pipeline occupancy for variable latency pipes such as lsu and texture can reduce pipe utilization, but not eliminate, if all threads are inactive or predicated off.

Does anyone know how to solve this problem?