I have a code similar to the following and kernel completes 10x quicker without any error/hang when result variable is mostly a bit value that creates 0xFFFFFFFF in mask variable:
unsigned int mask = 0;
for (int chunk= 0; chunk< NUM_CHUNKS; chunk++) {
...
unsigned int k= __ballot_sync(0xFFFFFFFF, m);
...
unsigned int result = __match_any_sync(0xFFFFFFFF, data);
...
mask = mask | result; // this is no-op if mask is 0xFFFFFFFF
}
output[id] = mask;
When the “mask” variable has all its bits set quickly, the kernel completes 10x faster. When the “mask” variable stays zero or rarely becomes non-zero, it runs with highest timing. The program runs without any error.
It is 10x fast again with:
mask = result | mask;
so the order of variables is not affecting anything here.
Does the same CUDA warp-shuffle thread attendance rule apply for the loop early-quit optimization too?
The loop becomes something like this:
28 0000000a 00c7cab0 ISETP.GE.U32.AND P0, PT, R19, 0x10, PT
29 0000000a 00c7cac0 ISETP.GT.U32.AND P1, PT, R19, 0xf, PT
30 0000000a 00c7cad0 P2R R0, PR, RZ, 0x2
31 0000000a 00c7cae0 @P0 LDG.E R2, [R16.64+-0x40]
32 0000000a 00c7caf0 ISETP.NE.AND P0, PT, R25, RZ, PT
33 0000000a 00c7cb00 SEL R29, R26, 0xffff0000, !P0
34 0000000a 00c7cb10 **BRA.DIV** ~URZ, 0xa00c7cdf0
35 0000000a 00c7cb20 ISETP.GT.U32.AND P0, PT, R19, 0xf, PT
36 0000000a 00c7cb30 **VOTE.ANY** R18, PT, P0
37 0000000a 00c7cb40 **BRA.DIV** ~URZ, 0xa00c7ce60
38 0000000a 00c7cb50 **MATCH.ANY** R4, R2
39 0000000a 00c7cb60 ISETP.GE.U32.AND P0, PT, R19, 0x10, PT
40 0000000a 00c7cb70 IADD3 R3, R23, 0x1, RZ
41 0000000a 00c7cb80 @P0 LDG.E R2, [R16.64]
42 0000000a 00c7cb90 ISETP.NE.AND P0, PT, R24, R3, PT
43 0000000a 00c7cba0 LOP3.LUT R3, R18, R4, R29, 0x80, !PT
44 0000000a 00c7cbb0 SEL R31, R26, 0xffff0000, !P0
45 0000000a 00c7cbc0 LOP3.LUT R29, R3, R28, RZ, 0xfc, !PT
46 0000000a 00c7cbd0 **BRA.DIV** ~URZ, 0xa00c7cec0
47 0000000a 00c7cbe0 **MATCH.ANY** R4, R2
48 0000000a 00c7cbf0 ISETP.GT.U32.AND P0, PT, R19, 0xf, PT
49 0000000a 00c7cc00 **VOTE.ANY** R18, PT, P0
50 0000000a 00c7cc10 IADD3 R23, R23, 0x2, RZ
51 0000000a 00c7cc20 IADD3 R16, P1, R16, 0x80, RZ
52 0000000a 00c7cc30 ISETP.NE.AND P0, PT, R23, 0x186a, PT
53 0000000a 00c7cc40 LOP3.LUT R4, R18, R4, R31, 0x80, !PT
54 0000000a 00c7cc50 IMAD.X R17, RZ, RZ, R17, P1
55 0000000a 00c7cc60 IADD3 R25, R25, 0x2, RZ
56 0000000a 00c7cc70 LOP3.LUT R28, R4, R29, RZ, 0xfc, !PT
57 0000000a 00c7cc80 @P0 BRA 0xa00c7cab0
There are multiple branch instructions. Last one is looping, the other 3 BRA.DIV is not by CUDA code. Compiler adds them. Compiler also adds 2 more vote/match_any instructions. So each loop iteration is doing something extra with match_any.
Edit:
Since program ran fine, I was guessing the for loop is not early-quitted but then saw the BRA.DIV instructions in the disassembly, and started thinking perhaps its adding the early quit.