branch diveragence with if/while same as if one of the threads in a warp returning

bbox · December 11, 2011, 11:42am

I understand that using control variables like if/while… can cause branch divergence if multiple threads in a warp take different paths, and serializes execution of threads in the warp, till all threads in the warp return back to a common path.

What if one of the threads in a warp returns from the kernel function, which means that particular thread has finished execution. Would this be equivalent to branch divergence like we see in if/while case, and serialize other threads in the same warp?

pasoleatis · December 11, 2011, 3:15pm

Yes. But having a branch that does something and 1 which does nothing it is ok.

Justin_Luitjens · December 11, 2011, 5:10pm

You should be careful returning early from a function as it means any future syncthreads will be undefined.

For example this:

int idx=threadIdx.x+blockIdx.x*blockDim.x;
if(idx>=N) return;

//do something
…

__syncthreads();

is a bug and should not be done. Instead do this:

int idx=threadIdx.x+blockIdx.x*blockDim.x;

if(idx<N) {
//do something
…
}

__syncthreads();

Also I have seen in some cases where an early return from a device function can cause threads completely diverge for the rest of the execution. Thus in general I have avoided using early return inside of a kernel or device function.

bbox · December 12, 2011, 6:19am

@pasoleatis @Justin,

Right. Returning from a thread function early, serializes other threads in the same block.

An if/while divergence of one thread resulting in serializing other threads in the same block sounds normal, but serializing threads in a block just because one thread completed execution sounds slightly bad, since some threads might face conditions in the data that makes it return prematurely. Is this the way the software architecture is designed that might be changed in the future or are there hardware restrictions that has resulted in serialization of threads in a block, when one thread in that block has returned early?

tera · December 12, 2011, 8:47am

What exactly do you mean by “serialize other threads”? The other threads will still execute in parallel. Divergence is not like coalescing in compute capability 1.0 and 1.1 devices, where a single thread with a different address causes all other memory accesses in the warp to be performed serially.

bbox · December 12, 2011, 11:50am

When I mean serialize I mean this serialize - From cuda best practices guide -

"Any flow control instruction (if, switch, do, for, while) can significantly affect the

instruction throughput by causing threads of the same warp to diverge; that is, to follow

different execution paths. If this happens, the different execution paths must be

serialized, increasing the total number of instructions executed for this warp. When all

the different execution paths have completed, the threads converge back to the same

execution path."

pasoleatis · December 12, 2011, 11:56am

I think the threads in each branches get executed together, but different branches in a warp are executed one after each other. I tried branches of the form:

if(r<rcut)
{
do something
}
else
{
do nothing
}

In this case taking away the if and executing “do something” for all threads was slower than having with the if.
The best case scenario is when you can organize your branches in a such way that one warp or more executes the same branch.

bbox · December 12, 2011, 1:21pm

Right. So then if I have a block of 16 threads, and 6 threads in the warp return from the kernel function prematurely, it shouldn’t serialize the other 10 threads right? The other 10 threads should run parallely without being serialized(as long as they don’t hit any branch divergence), shouldn’t they?

bbox · December 12, 2011, 1:25pm

So it’s advisable to put a _syncthreads() right before the return in all kernel functions(as long as there is one return in the kernel function and that one return isn’t inside any if/while branch)?

pasoleatis · December 12, 2011, 1:37pm

You should NOT use _syncthreads() unles your algorithm really needs it like the case of sum or histograms! Unnecessary _syncthreads() will only slow your code!

pasoleatis · December 12, 2011, 1:39pm

You should NOT use _syncthreads() unless your algorithm really needs it like the case of sum or histograms! Unnecessary _syncthreads() will only slow your code!

pasoleatis · December 12, 2011, 1:49pm

[quote name=‘bbox’ date=‘12 December 2011 - 01:21 PM’ timestamp=‘1323696072’ post=‘1341142’]

Right. So then if I have a block of 16 threads, and 6 threads in the warp return from the kernel function prematurely, it shouldn’t serialize the other 10 threads right? The other 10 threads should run parallely without being serialized(as long as they don’t hit any branch divergence), shouldn’t they?

[/quote

My understanding is that it will try to run the 10 threads in in one warp while while the other 6 in another warp, so running 1 extra warp. If you can put the branches to be in different warps ( like half of the threads are in one branch, while the other in another branch), there should be no penalty.

bbox · December 12, 2011, 2:08pm

But since those 6 threads have finished execution and returned from the kernel function, those 6 threads shouldn’t be executed at all even as a part of a separate warp, right?

Or even if the gpu puts those 6 threads in a different warp and tries to execute them, it should return immediately since all those 6 threads are dead, in which case the only overhead being scheduling that warp of 6 already-completed threads.

pasoleatis · December 12, 2011, 2:36pm

Yes. In the above case there will be 2 warps instead of 1, but the overhead should be small. In the worst case you have thread 0 branch 1, thread 1 branch 2 , thread 3 branch 0, thread 3 branch 1 and so on (even threadIdx.x branch 0 and odd threadIdx.x branch 1) in this case it will run 32 warps. This is why he branches should be grouped in such a way that the a group executes one branch while the other another branch.
Just my 2 cents :)

bbox · December 12, 2011, 2:43pm

Right. Thanks. Such grouping is unpredictable in my case, since it depends on the data supplied. I was mainly trying to solve the early return of some thread in a block from the kernel function.

pasoleatis · December 12, 2011, 3:16pm

I have a similar problem with random branch. I tried both case with if and without if. In my case with if on, the code was faster. I suspect that was becasue I had only 1 or 2 threads with branching, so it depends on by case.

Justin_Luitjens · December 12, 2011, 3:27pm

In the case that I saw the control was inside a loop within a device function which had a return condition on some threads. It was debugging code which was never kicked up but the fact that that early return was there caused every thread within a warp to operate in serial. That is each warp was 32-way serialized. This may have been fixed in a more recent compiler build but from what I have seen it is best to avoid early returns and instead use if’s to mask out threads instead.

bbox · December 12, 2011, 6:35pm

How did you debug that the threads in the warp in your case were 32-way serialized? Just to know so that I can use your technique to trace similar issues in my code.

Justin_Luitjens · December 13, 2011, 6:53am

It was a while ago so the details are fuzzy but I think I may have noticed it in the debugger when stepping through the threads.

Topic		Replies	Views
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1566	June 15, 2011
Problem with correct branching within a warp CUDA Programming and Performance	23	15640	May 28, 2009
Warp branching CUDA Programming and Performance	11	10280	October 26, 2010
global to shared mem loads and sync CUDA Programming and Performance	26	11441	February 21, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21238	March 12, 2007
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3941	December 21, 2016
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	20932	May 12, 2007
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2945	December 28, 2008
Running One Thread per CUDA Core Or at least allow for highly divergent, register-intensive kernels CUDA Programming and Performance	11	4627	December 1, 2011
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6664	August 1, 2008

branch diveragence with if/while same as if one of the threads in a warp returning

Related topics