Single Branch Divergence? [beginner question]

mynameisjohn · November 23, 2015, 9:04pm

Hi Guys,

This is sort of a “CUDA Basics” question, but I was wondering something about Warps and divergence.

It was my understanding that if you the following situation

// N threads total in this warp (probably 32), L+M=N

if (cond1){
    // L threads end up here
    ....
}
else{
    // M threads end up here
    ....
}

return;

Where L+M = N, then the L threads where cond1 == true must complete before the remaining M threads can do anything. However, I’m less clear on what happens in this case:

// N threads total in this warp (probably 32), L+M=N

if (cond1){
    // L threads end up here
    ....
}

// The remaining M threads do nothing

return;

Where there is no “else” condition, and the threads where cond1 == false do nothing (and return after the branch.)

I can see some stack overflow posts that imply the M threads are flagged as null, but does that mean they are free to do other things? Or do they just sit on their hands till the L threads that made it into the branch complete?

More succinctly, does the second case where there is only one branch of divergence incur a performance penalty?

little_jimmy · November 24, 2015, 5:15am

you can cast an if, else as 2 ifs

if (A)

else

if (A)

if (!A)

the sm finitely schedules a finite number of instructions to warps, on a warp basis
a thread failing a condition generally implies its instruction pointer is merely incremented
what non participating threads would do or where they end up generally depends on the depth of the branch, and whether there are pending synchronization calls ahead

mynameisjohn · November 24, 2015, 2:16pm

Fair enough. Suppose that the non participating threads do no more work, and return (as in the kernel in my original post.) There are no sync threads calls, and no more work for the thread to do. The branch depth is as in my original post as well, just one if and that’s it. In other words, if a thread realizes (for sure) that there is no more work for it to do, does it move on to something else?

As I type that I feel the answer is no, since it violates my concept of what a warp is. I also suppose I should just test this myself, I was just wondering if anyone knew offhand what the “scheduler”, if there is such a thing, does in this case.

BulatZiganshin · January 6, 2016, 12:01pm

it’s just a marketing names. the “warp” is really a thread, the “thread” is a simd lane, and nvidia gpus has dozens of cores with 32-wide simd engines. when you execute f.e. “if (x>0)” this x is contained in 32-wide simd register, so it’s actually a small array “float x[32]”. “x>0” comparison generates another small array “bool flags[32]” in a second simd register. now thread (i mean real thread!) executes operations in “then” part that all are masked by the flags register. once these commands are finished, it executes commands in “else” part with inversed mask. if you don’t yet learned avx-512, i suggest to look into it - it can do the same things. so, the entire code for “if(x>0) x+=y; else x-=z; …” looks like:

cmpgt x,r0,flags – compares x[i], i=0…31 to r0[i] and saves results in flags[i]
add(flags) x,y – x[i]+=y[i], writes results only to lanes enabled in the flags
sub(!flags) x,z – x[i]-=z[i], writes results only to lanes disabled in the flags
…

as you see, without “else” part the thread is still need to execute all commands in “then” before it will be able to execute the “…” part

for more complex code, gpus has the unique mechanism not yet seen on cpus - the gpu itself has per-thread (i mean real thread!) 32-bit mask. so “cmpgt” command set ups this mask and all the remaining code is masked by this condition, updating only those lanes of 32-wide registers that corresponds to 1s in the mask. once the “then” part is finished, gpu executes the command that inverts the special mask register and new inversed mask used to execute all commands in “else” part. once it finished too, gpu restires the original mask from special stack (since calling code may already used this technique)

Topic		Replies	Views
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2939	December 28, 2008
Thread Divergence CUDA Programming and Performance	2	2730	September 27, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21235	March 12, 2007
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	20921	May 12, 2007
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2713	December 13, 2011
Shift direction and divergence CUDA Programming and Performance	7	381	November 13, 2020
Question about divergent branching CUDA Programming and Performance	3	6430	May 21, 2009
global to shared mem loads and sync CUDA Programming and Performance	26	11440	February 21, 2008
A Question from Programming Massively Parallel Processors: A Hands-on Approach CUDA Programming and Performance	4	997	September 24, 2021
Warps and SIMD processing CUDA Programming and Performance	2	3747	July 13, 2007

Single Branch Divergence? [beginner question]

Related topics