I have a piece of code that looks similar to the following:
constexpr uint32_t NUM = 70;
uint32_t data[NUM];
//Initialize data in another for loop
//Calculate
for (uint32_t i = 0; i < NUM; i++) {
if (data[i] != 0) {
//Calculate
}
}
The code inside the if statement body will be compiled into 5 IMAD instructions, 3 LOP3 instructions, and 1 STL instruction. Since the for loop executes a fixed number of times, the entire for loop is fully unrolled. Within the unrolled for loop body, the code in the initial iterations of the loop is compiled into the following form:
00007f18 83299ec0 ISETP.NE.AND P1, PT, R105, RZ, PT
00007f18 83299ed0 BSSY B0, 0x7f1883299fe0
00007f18 83299ee0 ISETP.NE.AND P2, PT, R104, RZ, PT
00007f18 83299ef0 IMAD.IADD R2, R13, 0x1, -R4
00007f18 83299f00 ISETP.NE.AND P3, PT, R103, RZ, PT
00007f18 83299f10 ISETP.NE.AND P6, PT, R102, RZ, PT
00007f18 83299f20 ISETP.NE.AND P0, PT, R87, RZ, PT
00007f18 83299f30 @!P1 BRA 0x7f1883299fd0
00007f18 83299f40 IMAD.IADD R88, R14, 0x1, R15
00007f18 83299f50 IMAD R89, R2, 0x10, R105
00007f18 83299f60 LOP3.LUT R105, R105, 0xf, RZ, 0xc0, !PT
00007f18 83299f70 IMAD.U32 R88, R88, 0x10000, RZ
00007f18 83299f80 IMAD R91, R8, 0x4, R1
00007f18 83299f90 IADD3 R8, R8, 0x1, RZ
00007f18 83299fa0 IMAD.IADD R6, R6, 0x1, R105
00007f18 83299fb0 LOP3.LUT R88, R89, R88, RZ, 0xfc, !PT
00007f18 83299fc0 STL [R91], R88
00007f18 83299fd0 BSYNC B0
00007f18 83299fe0 BSSY B0, 0x7f188329a0a0
00007f18 83299ff0 @!P2 BRA 0x7f188329a090
It can be observed from the compiled sass code that each iteration of the loop is completed before the next iteration begins.However, after several iterations, the compiler starts using predicated registers to conditionally execute instructions within the loop body and parallelize multiple loop bodies, as shown below:
00007f18 8329a6d0 @P0 IMAD R87, R2, 0x10, R87
00007f18 8329a6e0 @P3 LOP3.LUT R15, R84, 0xf, RZ, 0xc0, !PT
00007f18 8329a6f0 @P4 IMAD R86, R2, 0x10, R86
00007f18 8329a700 @P5 IADD3 R91, R8, 0x1, RZ
00007f18 8329a710 @P5 IMAD R92, R8, 0x4, R1
00007f18 8329a720 @P0 LOP3.LUT R87, R87, R88, RZ, 0xfc, !PT
00007f18 8329a730 @P3 IMAD.IADD R6, R6, 0x1, R15
00007f18 8329a740 @P4 LOP3.LUT R86, R86, R88, RZ, 0xfc, !PT
00007f18 8329a750 @P5 IMAD.MOV.U32 R8, RZ, RZ, R91
00007f18 8329a760 @P2 LOP3.LUT R15, R83, 0xf, RZ, 0xc0, !PT
00007f18 8329a770 @P0 STL [R90], R87
As in the vast majority of cases, the values of these P registers are false, this results in the GPU executing a large number of useless instructions (my understanding is that even if the P registers of every thread in a warp are false, the instructions still need to be executed as if one of the thread’s P register values is true, which means issuing instructions and occupying pipelines, etc. Is this understanding correct?). Consequently, it reduces performance. Is there any way to prevent the compiler from adopting the optimization method described above?
Cuda toolkit version: Cuda compilation tools, release 12.3, V12.3.107