continue statement

Hi,

I have some kernel code with the following structure

[codebox]while(…) {

blah

if(…){

blah2

continue;

}

blah3

}[/codebox]

How does branching work here? Suppose five threads in a warp tripped the if condition. Does the entire warp go to the top of the while loop right away, or do the threads that didn’t hit the if condition continue executing blah3? Is this information somewhere in the reference manual?

Thanks,

Anatoliy

During such a branch divergence, all threads will execute blah3. The 27 threads that want to “continue” to the next loop iteration will be masked out such that results for all operations will not actually be applied to any data.

Thanks, that’s what I thought too. I tried optimizing this code by changing blah3; to if(__all(…)) {blah3; }, and I was very surprised to find out that my optimization failed. So I thought it may be because the compiler somehow already knew what to do from the continue statement. I guess not.

How would I find answers to questions like these on my own? I didn’t find anything specific enough in the manual, only that “conditionals are serially executed”.

This is not quite right. Upon hitting the if(…) a warp will be split into 2 warps. One of the warps will have threads on the taken path masked on and the other will have threads on the not taken path masked on. The taken warp will hit the continue statement and continue looping in this fashion without ever executing the instructions in blah3. The not taken warp will jump over the branch and execute blah3.

If all threads execute the same number of iterations in the while loop and the code in blah is significant, you might consider adding a _synchthreads() immediately before blah.

Are you sure about this? Everything I’ve seen so far indicates that there is indeed a masking going on, not a creation of two new warps, which would require extra register/shared-mem space.

Where did you get this? If this is part of the CUDA architecture, then it’s certainly new to me.

So now you’re also saying that a _syncthreads() will merge the two different warps back together?

That makes your theoretical branch-management system even more dubious. Threads place data into various locations based on the value of threadIdx. If warps were dynamically created and merged based on branches, that would completely disrupt this data-management technique. Additionally, your proposed system of creating a new warp on a branch divergence is really aimed at optimizing for latency rather than bandwidth, when we know CUDA is designed for optimization of bandwidth, and not latency.

-Raystonn

I was going off of information in this paper. http://www.google.com/url?sa=U&start=1…-J1I670rwV0IQSw

Edit – Also, now that I read your comments in more detail, I think you misunderstood me. When I said that warps would be split, I just meant that they would have different masks and program counters, they would still use the same registers, run on the same processor, and use the same shared memory. Most literature describes them as being run serially, but I don’t see why a barrel processor could not interleave their execution.

By splitting warps, I mean serializing warps as the terminology used in literature, where upon hitting a control decision a SIMD group will be split into an on and off part and run one at a time.

Also for our emulator we thought about implementing it both ways, and could not come up with a correct solution that just masks some threads off but keeps one thread of control. You run into massive problems with branches with immediate post dominators that are not the branch target. See the attached figure:

Blocks in the figure are basic blocks, black edges represent fall through paths and blue edges represent branch targets. Assume that you use the mechanism that you described ,starting with a warp of 32 threads all on. Consider the following execution:

entry - [11111111111111111111111111111111]

_Z12k_n_sequencePiii - [11111111111111111111111111111111]

fall through - [11111111111111111111111111111111]

$Lt_1_10 - [11111111111000000000000000000000]

Now, at basic block $Lt_1_10 , we have a divergent branch and some threads need to be masked off. How do I know when to turn them back on? What if I encounter another back edge (loop) on the not taken path? How do I recursively mask some off and then selectively mask only some others back on at specific points? How do I know which ones to mask back on?

The following is a dynamic instruction trace of the same program running on our emulator with the warp splitting mechanism that I described, and as in fung’s paper, we use compiler analysis to insert explicit reconverge instructions at the post dominators of divergent branches. Active is the number of active threads in the current warp, and stack is the number split warps:

CooperativeThreadArray.cpp:273:  [PC: 0, counter: 0] ld.param.s32 %r1, [__cudaparm__Z12k_n_sequencePiii_K + 12] [stack 1] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 1, counter: 1] mov.u32 %r2, 0 [stack 1] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 2, counter: 2] setp.le.s32 %p1, %r1, %r2 [stack 1] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 3, counter: 3] @%p1 bra $Lt_1_8 [stack 1] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 4, counter: 4] mov.u16 %rh1, %ctaid.x [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 5, counter: 5] mov.u16 %rh2, %ntid.x [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 6, counter: 6] mul.wide.u16 %r3, %rh1, %rh2 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 7, counter: 7] cvt.u32.u16 %r4, %tid.x [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 8, counter: 8] add.u32 %r5, %r4, %r3 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 9, counter: 9] mov.s32 %r6, %r1 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 10, counter: 10] ld.param.s32 %r7, [__cudaparm__Z12k_n_sequencePiii_N + 8] [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 11, counter: 11] setp.gt.s32 %p2, %r7, %r5 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 12, counter: 12] mov.s32 %r8, 0 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 13, counter: 13] mov.s32 %r9, %r6 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 14] @!%p2 bra $Lt_1_11 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 15] mul.lo.s32 %r10, %r1, %r5 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 16] mul.lo.s32 %r11, %r10, 2 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 17] add.s32 %r12, %r8, %r11 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 18] add.s32 %r13, %r12, 1 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 19] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 20] add.s32 %r14, %r10, %r8 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 21] cvt.s64.s32 %rd2, %r14 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 22] mul.lo.u64 %rd3, %rd2, 4 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 23] add.u64 %rd4, %rd1, %rd3 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 24] st.global.s32 [%rd4 + 0], %r13 [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 25] reconverge [stack 4] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 26] reconverge [stack 3] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 27] add.s32 %r8, %r8, 1 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 28] setp.ne.s32 %p3, %r1, %r8 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 29] @%p3 bra $Lt_1_10 [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 30] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 31] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 32] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 33] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 34] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 35] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 36] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 37] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 38] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 39] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 40] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 41] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 42] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 43] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 44] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 45] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 46] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 47] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 48] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 49] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 50] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 51] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 52] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 53] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 54] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 55] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 56] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 57] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 58] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 59] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 60] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 61] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 62] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 63] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 64] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 65] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 66] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 67] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 68] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 69] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 70] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 71] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 72] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 73] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 74] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 75] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 76] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 77] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 78] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 79] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 80] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 81] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 82] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 83] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 84] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 85] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 86] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 87] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 88] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 89] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 90] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 91] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 92] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 93] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 94] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 95] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 96] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 97] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 98] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 99] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 100] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 101] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 102] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 103] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 104] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 105] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 106] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 107] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 108] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 109] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 110] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 111] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 112] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 113] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 114] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 115] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 116] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 117] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 118] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 119] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 120] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 121] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 122] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 123] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 124] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 125] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 126] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 127] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 128] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 129] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 130] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 131] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 132] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 133] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 134] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 135] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 136] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 137] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 138] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 139] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 140] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 141] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 142] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 143] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 144] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 145] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 146] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 147] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 148] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 149] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 150] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 151] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 152] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 153] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 154] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 155] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 156] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 157] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 158] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 159] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 160] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 161] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 162] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 163] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 164] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 165] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 166] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 167] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 168] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 169] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 170] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 171] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 172] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 173] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 14, counter: 174] @!%p2 bra $Lt_1_11 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 15, counter: 175] mul.lo.s32 %r10, %r1, %r5 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 16, counter: 176] mul.lo.s32 %r11, %r10, 2 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 17, counter: 177] add.s32 %r12, %r8, %r11 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 18, counter: 178] add.s32 %r13, %r12, 1 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 19, counter: 179] ld.param.u64 %rd1, [__cudaparm__Z12k_n_sequencePiii_A] [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 20, counter: 180] add.s32 %r14, %r10, %r8 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 21, counter: 181] cvt.s64.s32 %rd2, %r14 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 22, counter: 182] mul.lo.u64 %rd3, %rd2, 4 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 23, counter: 183] add.u64 %rd4, %rd1, %rd3 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 24, counter: 184] st.global.s32 [%rd4 + 0], %r13 [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 185] reconverge [stack 5] [active 11]

CooperativeThreadArray.cpp:273:  [PC: 25, counter: 186] reconverge [stack 4] [active 21]

CooperativeThreadArray.cpp:273:  [PC: 26, counter: 187] add.s32 %r8, %r8, 1 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 27, counter: 188] setp.ne.s32 %p3, %r1, %r8 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 28, counter: 189] @%p3 bra $Lt_1_10 [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 29, counter: 190] reconverge [stack 3] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 30, counter: 191] reconverge [stack 2] [active 32]

CooperativeThreadArray.cpp:273:  [PC: 31, counter: 192] exit [stack 1] [active 32]

_Z12k_n_sequencePiii_cfg.dot.png

More references:

http://www.freepatentsonline.com/6947047.html - NVIDIA patent describing serialization

"In a class of embodiments, the invention is a programmable, pipelined graphics processor (a vertex processor, in preferred embodiments) having at least two processing pipelines. Each pipeline processes data in accordance with a program, including by executing branch instructions. The processor is operable in at least one parallel processing mode in which N data values (where N≧2) to be processed in parallel in accordance with the same program are launched simultaneously into N pipelines, and in at least one fully serialized mode in which only one pipeline at a time processes data values in accordance with the program (and operation of each other pipeline is frozen). During operation in each parallel processing mode, mode control circuitry recognizes and resolves branch instructions to be executed (before processing of data in accordance with each branch instruction starts) and causes the processor to operate in a fully serialized mode when (and preferably only for as long as) necessary to prevent any conflict between the pipelines due to branching. Typically, the mode control circuitry causes the processor to operate in the fully serialized mode when it determines that a branch will be taken in at least one pipeline and that the branch will not be taken to the same target in all the pipelines. Preferably the mode control circuitry also recognizes other instructions whose execution is predicted to cause conflicts between pipelines (e.g., instructions that would require conflicting access to a shared memory during the parallel processing mode to process data in multiple pipelines) and causes the processor to enter another serialized mode (e.g., a limited serialized mode) to prevent such conflicts. When the processor operates in a fully serialized mode or limited serialized mode it has a lower throughput rate than in the parallel processing mode. Typically, the processor receives multiple threads of data values (each thread typically comprising data for a different vertex) and processes different data threads in “lock-step” in different pipelines in the parallel processing mode (so that during each clock cycle, all pipelines receive data to be processed in accordance with the same instruction but each pipeline receives data belonging to a different thread) unless and until the processor recognizes that an instruction to be executed requires that the processor enter a serialized mode. "

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

http://graphics.cs.uni-sb.de/~woop/rpu/RPU_SIGGRAPH05.pdf - Ray tracing architecture describing the stack based warp splitting mechanism that we use. It says that the idea originally came from Sun in 1987:

Control Flow and Recursion: In order to allow for complex control flow even in an SIMD environment the architecture supports conditional branching and full recursion using masked execution [Slotnick et al. 1962] and a hardware-maintained register stack [Sun Microsystems 1987]. The top most part of the register stack is easily available through the SPU register file (see Section 3.1). Recursively tracing rays from any location in a shader is required to offer maximum flexibility to shader writers. Other approaches that only allow for tail-recursion or impose other restrictions [Kajiya 1986] are too limiting for practical use. A different control stack, hidden from the user, is used explicitly through function calls/returns and through a special trace instruc- tion for recursively tracing new rays. This stack is implicitly also used for executing conditional branches, by splitting a chunk into a sub-chunk that perform the branch and one that does not. One of these sub-chunks is pushed onto the stack while waiting for the other to finish its execution.

Very informative. Related to some deadlock results I described here.

It sounds like what you are describing is a particular implementation of a CUDA device, rather than what is described by the CUDA programming model. Expect device implementation details not described in the general CUDA programming model to change across architectures. Generally speaking though, I would expect the compiler to generate code that manages which threads are masked. That may or may not be how it is currently done. But again… out of scope really. You can’t rely on any of this device-specific detail. It will change.

-Raystonn