branch diveragence with if/while same as if one of the threads in a warp returning

I understand that using control variables like if/while… can cause branch divergence if multiple threads in a warp take different paths, and serializes execution of threads in the warp, till all threads in the warp return back to a common path.

What if one of the threads in a warp returns from the kernel function, which means that particular thread has finished execution. Would this be equivalent to branch divergence like we see in if/while case, and serialize other threads in the same warp?

Yes. But having a branch that does something and 1 which does nothing it is ok.

You should be careful returning early from a function as it means any future syncthreads will be undefined.

For example this:

int idx=threadIdx.x+blockIdx.x*blockDim.x;
if(idx>=N) return;

//do something

__syncthreads();

is a bug and should not be done. Instead do this:

int idx=threadIdx.x+blockIdx.x*blockDim.x;

if(idx<N) {
//do something

}

__syncthreads();

Also I have seen in some cases where an early return from a device function can cause threads completely diverge for the rest of the execution. Thus in general I have avoided using early return inside of a kernel or device function.

@pasoleatis @Justin,

Right. Returning from a thread function early, serializes other threads in the same block.

An if/while divergence of one thread resulting in serializing other threads in the same block sounds normal, but serializing threads in a block just because one thread completed execution sounds slightly bad, since some threads might face conditions in the data that makes it return prematurely. Is this the way the software architecture is designed that might be changed in the future or are there hardware restrictions that has resulted in serialization of threads in a block, when one thread in that block has returned early?

What exactly do you mean by “serialize other threads”? The other threads will still execute in parallel. Divergence is not like coalescing in compute capability 1.0 and 1.1 devices, where a single thread with a different address causes all other memory accesses in the warp to be performed serially.

When I mean serialize I mean this serialize - From cuda best practices guide -

"Any flow control instruction (if, switch, do, for, while) can significantly affect the

instruction throughput by causing threads of the same warp to diverge; that is, to follow

different execution paths. If this happens, the different execution paths must be

serialized, increasing the total number of instructions executed for this warp. When all

the different execution paths have completed, the threads converge back to the same

execution path."

I think the threads in each branches get executed together, but different branches in a warp are executed one after each other. I tried branches of the form:

if(r<rcut)
{
do something
}
else
{
do nothing
}

In this case taking away the if and executing “do something” for all threads was slower than having with the if.
The best case scenario is when you can organize your branches in a such way that one warp or more executes the same branch.

Right. So then if I have a block of 16 threads, and 6 threads in the warp return from the kernel function prematurely, it shouldn’t serialize the other 10 threads right? The other 10 threads should run parallely without being serialized(as long as they don’t hit any branch divergence), shouldn’t they?

So it’s advisable to put a _syncthreads() right before the return in all kernel functions(as long as there is one return in the kernel function and that one return isn’t inside any if/while branch)?

You should NOT use _syncthreads() unles your algorithm really needs it like the case of sum or histograms! Unnecessary _syncthreads() will only slow your code!

You should NOT use _syncthreads() unless your algorithm really needs it like the case of sum or histograms! Unnecessary _syncthreads() will only slow your code!

[quote name=‘bbox’ date=‘12 December 2011 - 01:21 PM’ timestamp=‘1323696072’ post=‘1341142’]

Right. So then if I have a block of 16 threads, and 6 threads in the warp return from the kernel function prematurely, it shouldn’t serialize the other 10 threads right? The other 10 threads should run parallely without being serialized(as long as they don’t hit any branch divergence), shouldn’t they?

[/quote

My understanding is that it will try to run the 10 threads in in one warp while while the other 6 in another warp, so running 1 extra warp. If you can put the branches to be in different warps ( like half of the threads are in one branch, while the other in another branch), there should be no penalty.

But since those 6 threads have finished execution and returned from the kernel function, those 6 threads shouldn’t be executed at all even as a part of a separate warp, right?

Or even if the gpu puts those 6 threads in a different warp and tries to execute them, it should return immediately since all those 6 threads are dead, in which case the only overhead being scheduling that warp of 6 already-completed threads.

Yes. In the above case there will be 2 warps instead of 1, but the overhead should be small. In the worst case you have thread 0 branch 1, thread 1 branch 2 , thread 3 branch 0, thread 3 branch 1 and so on (even threadIdx.x branch 0 and odd threadIdx.x branch 1) in this case it will run 32 warps. This is why he branches should be grouped in such a way that the a group executes one branch while the other another branch.
Just my 2 cents :)

Right. Thanks. Such grouping is unpredictable in my case, since it depends on the data supplied. I was mainly trying to solve the early return of some thread in a block from the kernel function.

I have a similar problem with random branch. I tried both case with if and without if. In my case with if on, the code was faster. I suspect that was becasue I had only 1 or 2 threads with branching, so it depends on by case.

In the case that I saw the control was inside a loop within a device function which had a return condition on some threads. It was debugging code which was never kicked up but the fact that that early return was there caused every thread within a warp to operate in serial. That is each warp was 32-way serialized. This may have been fixed in a more recent compiler build but from what I have seen it is best to avoid early returns and instead use if’s to mask out threads instead.

How did you debug that the threads in the warp in your case were 32-way serialized? Just to know so that I can use your technique to trace similar issues in my code.

It was a while ago so the details are fuzzy but I think I may have noticed it in the debugger when stepping through the threads.