Single Branch Divergence? [beginner question]

Hi Guys,

This is sort of a “CUDA Basics” question, but I was wondering something about Warps and divergence.

It was my understanding that if you the following situation

// N threads total in this warp (probably 32), L+M=N

if (cond1){
    // L threads end up here
    ....
}
else{
    // M threads end up here
    ....
}

return;

Where L+M = N, then the L threads where cond1 == true must complete before the remaining M threads can do anything. However, I’m less clear on what happens in this case:

// N threads total in this warp (probably 32), L+M=N

if (cond1){
    // L threads end up here
    ....
}

// The remaining M threads do nothing

return;

Where there is no “else” condition, and the threads where cond1 == false do nothing (and return after the branch.)

I can see some stack overflow posts that imply the M threads are flagged as null, but does that mean they are free to do other things? Or do they just sit on their hands till the L threads that made it into the branch complete?

More succinctly, does the second case where there is only one branch of divergence incur a performance penalty?

you can cast an if, else as 2 ifs

if (A)

else

if (A)

if (!A)

the sm finitely schedules a finite number of instructions to warps, on a warp basis
a thread failing a condition generally implies its instruction pointer is merely incremented
what non participating threads would do or where they end up generally depends on the depth of the branch, and whether there are pending synchronization calls ahead

Fair enough. Suppose that the non participating threads do no more work, and return (as in the kernel in my original post.) There are no sync threads calls, and no more work for the thread to do. The branch depth is as in my original post as well, just one if and that’s it. In other words, if a thread realizes (for sure) that there is no more work for it to do, does it move on to something else?

As I type that I feel the answer is no, since it violates my concept of what a warp is. I also suppose I should just test this myself, I was just wondering if anyone knew offhand what the “scheduler”, if there is such a thing, does in this case.

it’s just a marketing names. the “warp” is really a thread, the “thread” is a simd lane, and nvidia gpus has dozens of cores with 32-wide simd engines. when you execute f.e. “if (x>0)” this x is contained in 32-wide simd register, so it’s actually a small array “float x[32]”. “x>0” comparison generates another small array “bool flags[32]” in a second simd register. now thread (i mean real thread!) executes operations in “then” part that all are masked by the flags register. once these commands are finished, it executes commands in “else” part with inversed mask. if you don’t yet learned avx-512, i suggest to look into it - it can do the same things. so, the entire code for “if(x>0) x+=y; else x-=z; …” looks like:

cmpgt x,r0,flags – compares x[i], i=0…31 to r0[i] and saves results in flags[i]
add(flags) x,y – x[i]+=y[i], writes results only to lanes enabled in the flags
sub(!flags) x,z – x[i]-=z[i], writes results only to lanes disabled in the flags

as you see, without “else” part the thread is still need to execute all commands in “then” before it will be able to execute the “…” part

for more complex code, gpus has the unique mechanism not yet seen on cpus - the gpu itself has per-thread (i mean real thread!) 32-bit mask. so “cmpgt” command set ups this mask and all the remaining code is masked by this condition, updating only those lanes of 32-wide registers that corresponds to 1s in the mask. once the “then” part is finished, gpu executes the command that inverts the special mask register and new inversed mask used to execute all commands in “else” part. once it finished too, gpu restires the original mask from special stack (since calling code may already used this technique)