Problem with correct branching within a warp

I have a problem with order of branching within a warp. Or maybe there is something else going wrong there? … Consider the following code:

[codebox]global void doSomething(float *c) {

shared volatile float u;



if (threadIdx.x==12) {






I am running it with <<<1,32>>> configuration.

As a result I get 0 in c[0].

My guess is that the c[0] assignment is actually executed first, and only later the threadIdx.x==12 branch.

Even adding a __syncthreads() after the if statement isn’t helping.

A partial solution would be:

[codebox]global void doSomething(float *c) {

shared volatile float u;



if (threadIdx.x==12) u=5;

if (threadIdx.x==12) return;



probably because now the branch is resolved through a predicate. Although this solution works with this simple example, it would be painfully long or even undoable in real-life situation where the branch is much longer.

My question is - is there a way to force somehow correct order of branch evaluation?

Or some other better walkaround for this problem?

I recommend initializing your c[0] to a known value in your application (like 0xdeadbeef)
Check your kernel launch and cudaMemcpy for errors.

Then, we will be in a better position to tell what is happening…

Its an interesting experiment though. If you could publish complete source code, it will be useful

__global__ void doSomething(float *c) 


	__shared__ volatile float u;




	if (threadIdx.x!=12) 


		c[0]= u;  

	} else {





Changing the condition otherwise leaves the desired result. (TESLA C1060)

Anyway, one should NOT base logic on such hardware quirks!

Here is the PTX for the code above:

cvt.u32.u16 	%r1, %tid.x;	 	// 

	mov.u32 	%r2, 12;			 	// 

	setp.eq.u32 	%p1, %r1, %r2;   	// 

	@%p1 bra 	$Lt_0_1282;		 	// 

	.loc	14	11	0

	ld.volatile.shared.f32 	%f2, [u];	// id:9 u+0x0

	ld.param.u32 	%r3, [__cudaparm__Z11doSomethingPf_c];	// id:11 __cudaparm__Z11doSomethingPf_c+0x0 	[%r3+0], %f2;	  // id:12

	bra.uni 	$LBB5__Z11doSomethingPf;	// 


	.loc	14	13	0

	mov.f32 	%f3, 0f40a00000;	 	// 5

	st.volatile.shared.f32 	[u], %f3;	// id:9 u+0x0


	.loc	14	17	0

	exit;						 	//

I did a few tweaks like:

  1. Changing threadIdx.x from 12 to other values like (0, 31)
  2. Adding more complexity to the “else” part (threadId==12 path) of the code above.

None of this had any effect on the execution. Its always the same. The ELSE part is finished first, followed by IF part.

I completly agree.

This small code is a boiled down case of much, much bigger project I am currently developping where I triggered this quirk unwillingly.

The full code as you requested

[codebox]#include <stdio.h>

global void doSomething(float *c) {

shared volatile float u;



if (threadIdx.x==12) {






int main() {

float *r;

cudaMalloc( (void**) &r, sizeof(float));

cudaMemset( r, 127, sizeof(float));

doSomething<<<1,32>>>( r );

float gg;



return 0;


OK, Here is a hypothesis on divergent-warps:

  1. When a WARP diverges, there is a need for the hardware to create 2 warps:
    A ) One that does NOT take the branch and continues serially. (parent)
    B ) One that takes the branch (child)

  2. The child warp is linked to parent warp.
    Symbolically we can represent it as A <- B.
    A is the parent and B is the child

  3. For every warp-divergence, a point of re-convergence is arrived at – by the code generator (courtesy: PTX Instruction Guide)
    I am assuming that this data will be available to the hardware at run-time. (otherwise who else needs it?)

  4. This point of re-convergence is stored as a parameter for only in the child warp.

  5. Child-warp is given control of execution first.

  6. Child-warp executes until the point of re-convergence and signals the parent to run.
    The parent has access to the child warp that activated it. And, so it knows the point of re-convergence.

  7. Parent runs until point of re-convergence and merges with the child warp again.

I will back my hypothesis with basis-of-reasoning in the next post

The problem here is synchronization. You should put in __syncthreads() to let the change visible to all threads and avoid using ‘return’ in conditional branches. Try the following code:

[codebox]global void doSomething(float *c) {

shared volatile float u;



if (threadIdx.x==12) {






__syncthreads() is only for sychronizing warps. In this case, there is only 1 warp.

However, this might work because there is no ELSE statement out there. So, the divergent part will just wait at the end of IF statement and re-join with the other warp and execute.

Thus even without a __syncthreads (without the ELSE) – this code will work. But the question is not about this behaviour (i think).



	if (threadIdx.x==12) 



	}  else {

	c[0] = u;


The code above will yield 123 (with 1 block, 32 threads)



	if (threadIdx.x==12) 




				c[0] = u;

This code above will yeild 5 (1 block, 32 threads)

That would be __threadfence_block(), but I am using volatile shared variable not to think about that problem.

However return statement in the branch could be indeed the real source of the problem.

However in some cases, inability to put ‘return’ in some very deep condition statement may increase complexity of implementation and add running overhead.

Anyway I haven’t seen anywhere in the programming guide that you shouln’t not put ‘return’ statement in a conditional branch - did I miss it?

Also from my personal experience, conditional return does not mess up with __syncthreads() – program does not hang.

I am curious about what exactly you want to accomplish? Maybe there is some other clean way to do it.

In this thread, I just want to understand what is happening, why and how to counteract it. I tend to attract strange behaviour of hardware/software :)

In general, there are some smaller or bigger situations that one could trigger this behaviour (accidently or not). In my case I am working with my simple assertion system for the kernels running on the device (not deviceemu).

For a kernels hundred of lines long tracking a bug is not an easy task and deviceemu has proven to be not very helpful. Certain parts of the code will work on device while on deviceemu will fail, most likely due to the fact that ‘warpSize’ on deviceemu is 1 and not 32.

Therefore, instead of fighting with device and deviceemu compatibility, I am running my (often unfinished) code immediately on GPU guarded with lots of asserts. When one fails, I want the kernel to terminate as quickly as possible and report the correct problem. Currently the macro looks like follows:

[codebox]#define CUDA_ASSERT(condition , raiseErrorCode, ad1,ad2,ad3,ad4) do { \

if (!(condition)) { \

	if (atomicAdd(&_globalError->accessSemaphor,1)==0) { \   <-- to ensure exactly one thread is reporting the problem

		_globalError->errorCode=(raiseErrorCode); \

		_globalError->dimGrid=gridDim; \

		_globalError->dimBlock=blockDim; \

		_globalError->block=blockIdx; \

		_globalError->thread=threadIdx; \

		_globalError->a.x=(ad1); \   <-- helper values, can be anything 

		_globalError->a.y=(ad2); \

		_globalError->a.z=(ad3); \

		_globalError->a.w=(ad4); \

		} \

	__threadfence(); \

	return; \

} \

if (_globalError->accessSemaphor>0) \    <-- maybe some other thread ended in error state (could be in a different assert)

	return;	\

} while(0)


Once kernel is terminated, I load to host _globalError and check if errorCode is nonzero. If it is, I output to the screen all data associated with the event.

Note I don’t want to use __syncthreads() in the macro, because I want it to be usable in branches as well.

Macro will work only in global function, otherwise ‘return’ won’t terminate the kernel. If I am not mistaken there is no exit() function which could terminate the thread from anywhere, is there?

This simple mechanism has proven to be very helpful for me on several occasions already, but sometimes tricky things like those described above occur and I get confused.

You might be interested in reading this patent: US7353369

which seems to reflect quite accurately NVIDIA’s implementation (it worked fine for me so far as the default branching algorithm in Barra).

So yes, ‘else’ blocks are executed before ‘if’ blocks in current NVIDIA GPUs.

Of course, the usual disclaimer: don’t depend on this, unsupported, might change without notice, and so on…


So if I got it right, your problem is that the second ‘if’ is executed first and threads that didn’t assert or didn’t get the semaphore continue running until the next assert?

There is an exit instruction in PTX (and assembly), but it is not exposed in C for CUDA…

Well, I don’t expect all threads to stop in this assert because I didn’t syncthread-ed them, but I was hoping it would work at least within a single warp, which is not the case.
My problem is that some threads (even from the same warp) may continue working, pass several future asserts and change values in shared memory before the reporting thread executes atomicAdd and sets help varialbes _globalError->a (which usually depend on shared content).

I will read the patent you pointed me to. Thank you.

Just some quotes from the CUDA spec:

1- “The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0”

2- “A warp executes one common instruction at a time”

3- “If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.”

4- “If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location and the order in which they occur is undefined, but one of the writes is guaranteed to succeed.”

I am very concerned about the 4th quote. I am not sure how it works in the case of more than one active block per multiprocessor (I found that in that case the guarantee “one of the writes is successful” fails:

From my understanding, ‘volatile’ declaration is only for avoiding over-smart optimizations during compiling step. It does not guarantee that, in the same warp, the data written to shared memory by one thread at one instruction will be instantly visible to all other threads at the next instruction. You still have to use either __threadfence_block() or __syncthread() for synchronization. I prefer the later because it is compatible with all CUDA revisions and effective for the whole block.

I know all 4 statements you quoted.

I didn’t realise about the problem with the 4th one, however it should not influence my initial problem code as it works on exactly 1 block anyway, neither my assertion macro as the first atomic instruction ensures that there is exactly one thread writing to rest of _globalError object.

The 3rd statement says, that warp executes one branch after another but does not specify the order of execution. Intuitively it should be the order in which programmer coded it, but apparently it is not a necessity and I don’t know how to control it.

In the same warp it should, as long as compiler does not do optimalisation (that is why volatile is needed). Am I mistaken here?

‘volatile’ does not ensure synchronization even though there is only one warp in your kernel. ‘volatile’ lets the compiler always generate code to read the shared/global memory actually but not imply data from the previous values in the same thread. This mechanism does not imply synchronization due to the 2 reasons:

  • Pipeline structure of the multiprocessor: the reading phase of the next instruction may occur before the writing phase of the current instruction.

  • Writes to shared/global memory take time.

By default CUDA ignores this latency when compiling codes for good speed-up. That is the user’s responsibility to add synchronizations.

Thus, always synchronizing warp/block after writes to shared/global memory will be a safe habit. You can try to minimize the number of synchronizing instructions in use, but it will be very dangerous if you don’t use any of them at all.

Here is one more quote from the newest CUDA spec:

“Note that even if myArray is declared as volatile in the code sample above, there is no guarantee, in general, that ref2 will be equal to 2 in thread tid since thread tid might read myArray[tid] into ref2 before thread tid-1 overwrites its value by 2. Synchronization is required as mentioned in Section 5.4.”

I think that exit is only for a thread (or a block). It is not for the kernel. (I could be wrong here)

well, you cant control it. But now we know the “else” part is executed first. (on TESLA C1060 as well). So, there is no need to control it. (though it might change (very remote) in future hardware).

but at the momment, I think you can take advantage of this fact - if you would like.

Thanks for raising this interesting topic.

When we leave something undefined, it really means “it is unsafe to depend on this behavior at all.” Like the 16KB of shmem hack people are working on, don’t write any code that depends on undefined behavior because that means it is convenient to change it at the hardware/driver/compiler level and it might fall out from under you.

I emphasise on “in general”! Because we didn’t say anything about warps over here.

But section 5.4 says:

"However, in the following slightly modified code sample, threads are guaranteed to belong to the same warp, so that there is no need for any __syncthreads().

[some code here]

Simply removing the __syncthreads() is not enough however; myArray also needs to be declared as volatile as described in Section B.2.4."