By splitting warps, I mean serializing warps as the terminology used in literature, where upon hitting a control decision a SIMD group will be split into an on and off part and run one at a time.
Also for our emulator we thought about implementing it both ways, and could not come up with a correct solution that just masks some threads off but keeps one thread of control. You run into massive problems with branches with immediate post dominators that are not the branch target. See the attached figure:
Blocks in the figure are basic blocks, black edges represent fall through paths and blue edges represent branch targets. Assume that you use the mechanism that you described ,starting with a warp of 32 threads all on. Consider the following execution:
entry - [11111111111111111111111111111111]
_Z12k_n_sequencePiii - [11111111111111111111111111111111]
fall through - [11111111111111111111111111111111]
$Lt_1_10 - [11111111111000000000000000000000]
Now, at basic block $Lt_1_10 , we have a divergent branch and some threads need to be masked off. How do I know when to turn them back on? What if I encounter another back edge (loop) on the not taken path? How do I recursively mask some off and then selectively mask only some others back on at specific points? How do I know which ones to mask back on?
The following is a dynamic instruction trace of the same program running on our emulator with the warp splitting mechanism that I described, and as in fung’s paper, we use compiler analysis to insert explicit reconverge instructions at the post dominators of divergent branches. Active is the number of active threads in the current warp, and stack is the number split warps:
CooperativeThreadArray.cpp:273: [PC: 0, counter: 0] ld.param.s32 %r1, [__cudaparm__Z12k_n_sequencePiii_K + 12] [stack 1] [active 32]
CooperativeThreadArray.cpp:273: [PC: 1, counter: 1] mov.u32 %r2, 0 [stack 1] [active 32]
CooperativeThreadArray.cpp:273: [PC: 2, counter: 2] setp.le.s32 %p1, %r1, %r2 [stack 1] [active 32]
CooperativeThreadArray.cpp:273: [PC: 3, counter: 3] @%p1 bra $Lt_1_8 [stack 1] [active 32]
CooperativeThreadArray.cpp:273: [PC: 4, counter: 4] mov.u16 %rh1, %ctaid.x [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 5, counter: 5] mov.u16 %rh2, %ntid.x [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 6, counter: 6] mul.wide.u16 %r3, %rh1, %rh2 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 7, counter: 7] cvt.u32.u16 %r4, %tid.x [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 8, counter: 8] add.u32 %r5, %r4, %r3 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 9, counter: 9] mov.s32 %r6, %r1 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 10, counter: 10] ld.param.s32 %r7, [__cudaparm__Z12k_n_sequencePiii_N + 8] [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 11, counter: 11] setp.gt.s32 %p2, %r7, %r5 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 12, counter: 12] mov.s32 %r8, 0 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 13, counter: 13] mov.s32 %r9, %r6 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 14] @!%p2 bra $Lt_1_11 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 15] mul.lo.s32 %r10, %r1, %r5 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 16] mul.lo.s32 %r11, %r10, 2 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 17] add.s32 %r12, %r8, %r11 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 18] add.s32 %r13, %r12, 1 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 19] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 20] add.s32 %r14, %r10, %r8 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 21] cvt.s64.s32 %rd2, %r14 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 22] mul.lo.u64 %rd3, %rd2, 4 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 23] add.u64 %rd4, %rd1, %rd3 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 24] st.global.s32 [%rd4 + 0], %r13 [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 25] reconverge [stack 4] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 26] reconverge [stack 3] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 27] add.s32 %r8, %r8, 1 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 28] setp.ne.s32 %p3, %r1, %r8 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 29] @%p3 bra $Lt_1_10 [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 30] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 31] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 32] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 33] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 34] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 35] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 36] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 37] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 38] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 39] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 40] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 41] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 42] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 43] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 44] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 45] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 46] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 47] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 48] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 49] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 50] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 51] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 52] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 53] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 54] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 55] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 56] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 57] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 58] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 59] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 60] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 61] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 62] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 63] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 64] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 65] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 66] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 67] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 68] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 69] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 70] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 71] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 72] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 73] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 74] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 75] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 76] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 77] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 78] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 79] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 80] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 81] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 82] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 83] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 84] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 85] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 86] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 87] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 88] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 89] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 90] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 91] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 92] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 93] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 94] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 95] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 96] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 97] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 98] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 99] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 100] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 101] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 102] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 103] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 104] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 105] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 106] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 107] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 108] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 109] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 110] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 111] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 112] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 113] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 114] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 115] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 116] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 117] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 118] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 119] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 120] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 121] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 122] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 123] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 124] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 125] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 126] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 127] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 128] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 129] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 130] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 131] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 132] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 133] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 134] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 135] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 136] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 137] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 138] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 139] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 140] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 141] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 142] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 143] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 144] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 145] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 146] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 147] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 148] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 149] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 150] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 151] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 152] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 153] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 154] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 155] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 156] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 157] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 158] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 159] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 160] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 161] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 162] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 163] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 164] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 165] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 166] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 167] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 168] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 169] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 170] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 171] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 172] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 173] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 14, counter: 174] @!%p2 bra $Lt_1_11 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 15, counter: 175] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 16, counter: 176] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 17, counter: 177] add.s32 %r12, %r8, %r11 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 18, counter: 178] add.s32 %r13, %r12, 1 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 19, counter: 179] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 20, counter: 180] add.s32 %r14, %r10, %r8 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 21, counter: 181] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 22, counter: 182] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 23, counter: 183] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 24, counter: 184] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 185] reconverge [stack 5] [active 11]
CooperativeThreadArray.cpp:273: [PC: 25, counter: 186] reconverge [stack 4] [active 21]
CooperativeThreadArray.cpp:273: [PC: 26, counter: 187] add.s32 %r8, %r8, 1 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 27, counter: 188] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 28, counter: 189] @%p3 bra $Lt_1_10 [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 29, counter: 190] reconverge [stack 3] [active 32]
CooperativeThreadArray.cpp:273: [PC: 30, counter: 191] reconverge [stack 2] [active 32]
CooperativeThreadArray.cpp:273: [PC: 31, counter: 192] exit [stack 1] [active 32]