Bug report: Threads out of sync, branched syncthreads problem

Operating System: Windows XP 64-bit, but host and device code compiled as 32 bit

CUDA toolkit: v. 2.3

SDK: v. 2.3

Compiler: Visual Studio 2008, v. 9.0.21022.8 RTM

deCUDA: v. 0.4.1. with some fast fixes written by myself


CPU Intel Core2 Duo E8600 3.33GHz,

GPU NVIDIA GeForce 9600GT (for display only),

GPU NVIDIA GeForce GTX 285 (for CUDA computation)

Driver: 191.07

Problem: __syncthreads() bars go “out of sync” despite they are put in a branch that is either taken by whole block or by no threads at all.

From Programming Guide:

Consider the following source code:

__device__ void devFunction(int *output, int cnd) {

	if (threadIdx.x>13) {

		if (int(threadIdx.x)-cnd>=64) //(A)



	__syncthreads(); //(B1)


	__syncthreads(); //(B2)


__global__ void myKernel4(int *output,int cnd) {

	int lane=threadIdx.x % 32;

	int warp=threadIdx.x/32;



	__shared__ int cc;

	cc=0; //(C)




	if (lane==0)

		output[warp]=cc; //(D)



void isolatedProblem() {

	const int maxBlocks=60;

	const int warpCount=4;

	int *gpuOutput;

	cudaMalloc((void**)&gpuOutput, sizeof(int)*32*warpCount);

	int cpuOutput[32*warpCount];

	for (int i=0; i<32*warpCount; ++i)


	myKernel4<<<1,warpCount*32>>>(gpuOutput,10000); //(E)


	for (int i=0; i<warpCount; ++i)

		printf(" %d",cpuOutput[i]);


At position (A) we have a conditional jump out from the funcion devFunction which theoretically could lead to situation that (B1) and (B2) are executed by some but not all threads.

However, if we look on how kernel is invoked (E), we see that this conditon is never taken, therefore everything should work correctly - it fulfills the all-or-none thread requirement quoted from the Programming Guide.

The compiler does not know at compile time about that fact.

To assert that all threads work and sync correctly, we introduce a control shared variable cc, initially set to 0 ( C ). After all threads are synchronised, each thread increments the variable and every warp reports the value to global memory (D).

If everything works as intended, cc should be equal to the dimention of the block and we should see the same value in the output for every warp. However if we compile and run the above code, we get:

127 95 95 96

Which shows the threads are not synchronised correctly!

If you remove the ‘return’ instruction however (which is never executed anyway) the output is correct.

128 128 128 128

Further investigation to the problem:

In an attept to track the problem I examined the produced PTX code:

.entry _Z9myKernel4Pii (

		.param .u32 __cudaparm__Z9myKernel4Pii_output,

		.param .s32 __cudaparm__Z9myKernel4Pii_cnd)


	.reg .u32 %rv1;

	.reg .u32 %r<20>;

	.reg .u64 %rd<4>;

	.reg .pred %p<5>;

	.shared .s32 __cuda_cc0;


	cvt.u32.u16 	%r1, %tid.x;

	mov.u32 	%r2, 13;

	setp.le.u32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_2_2562; 		//if (threadIdx.x<=13) skip to $Lt_2_2562

	ld.param.s32 	%r3, [__cudaparm__Z9myKernel4Pii_cnd];

	sub.s32 	%r4, %r1, %r3;   	//%r4:=threadIdx.x-cnd

	mov.u32 	%r5, 63;

	setp.gt.s32 	%p2, %r4, %r5;  

	@%p2 bra 	$Lt_2_258;   		//if (%r4>63) skip to $Lt_2_258


	bar.sync 	0;

	ld.param.u32 	%r6, [__cudaparm__Z9myKernel4Pii_output];

	ld.global.s32 	%r7, [%r6+124];	//output[31]+=1

	add.s32 	%r8, %r7, 1;

	st.global.s32 	[%r6+124], %r8;


									//end of devFunction

	bar.sync 	0;

	mov.s32 	%r9, 0;

	st.shared.s32 	[__cuda_cc0], %r9;	//cc=0

	bar.sync 	0;

	mov.u32 	%r10, __cuda_cc0;

	mov.s32 	%r11, 1;

	atom.shared.add.s32 	%rv1, [%r10], %r11;	//atomicAdd(&cc,1)

	bar.sync 	0;

	and.b32 	%r12, %r1, 31;

	mov.u32 	%r13, 0;

	setp.ne.s32 	%p3, %r12, %r13;

	@%p3 bra 	$Lt_2_3074;			//if (lane!=0) skip to $Lt_2_3074

	ld.shared.s32 	%r14, [__cuda_cc0];

	ld.param.u32 	%r15, [__cudaparm__Z9myKernel4Pii_output];

	shr.u32 	%r16, %r1, 5;

	cvt.u64.s32 	%rd1, %r16;

	mul.lo.u64 	%rd2, %rd1, 4;

	cvt.s32.u64 	%r17, %rd2;

	add.u32 	%r18, %r15, %r17;

	st.global.s32 	[%r18+0], %r14;





I added some comments for readability and removed .loc debug instructions. I didn’t find any compiler errors at this point, but if you have time, please do double-check my statement :)

Further investigation involved digging into produced cubin file. To that end I used decuda. It is not perfect, on rare occasions some instructions are misinterpreted, but the important part seems to be parsed correctly:

.entry _Z9myKernel4Pii


.lmem 0

.smem 28

.reg 3

.bar 1

cvt.u32.u16 $r0, $r0.lo						  //$r0:=int(threadIdx.x)


	set.le.u32 $p0|$o127, $r0, c1[0x0000]		//$p0:=(threadIdx.x<=13)

	join.label label1

	@$p0.ne bra.label label0

		subr.u32 $r1, s[0x0014], $r0			//$r1:=-cnd+threadIdx.x

		set.gt.s32 $p0|$o127, $r1, c1[0x0004]	//$p0:=(cnd-threadIdx.x>63)

		@$p0.ne bra.label label1				//exit devFunction


	bar.sync.u32 0x00000000						//__syncthreads

	add.b32 $r2, s[0x0010], 0x0000007c			//$r2:=output+31*sizeof(int)

	mov.u32 $r1, g[$r2]							

	add.b32 $r1, $r1, 0x00000001

	mov.u32 g[$r2], $r1

												//end of devFunction

label1: nop.join 

bar.sync.u32 0x00000000

mov.b32 s[0x0018], $r124

bar.sync.u32 0x00000000

join.label label4

label2: mov.b32 $p0|$r1, s[0x0018]// (unk1 00800000)

@$p0.lt bra.label label3

bra.label label2

label3: add.b32 $r1, $r1, 0x00000001

mov.b32 s[0x0018], $r1

mov.?5?.b32 s[0x0018], $r1

label4: nop.join 

bar.sync.u32 0x00000000

and.b32 $p0|$o127, $r0, c1[0x0008]

@$p0.ne return 

shr.u32 $r0, $r0, 0x00000005

shl.u32 $r1, $r0, 0x00000002

mov.half.b32 $r0, s[0x0018]

add.half.b32 $r1, s[0x0010], $r1

mov.end.u32 g[$r1], $r0

#.constseg 1:0x0000 const


#d.32 0x0000000d, 0x0000003f, 0x0000001f // 0000



And again the code seems to be correct (please double-check).


So - where is the problem? Driver? Hardware?

Or maybe situation where __syncthreads() may be used is more restrictive? But how exactly ?

Even if this problem cannot be corrected, I believe it should be understood more in-depth and the results made public to avoid similar coding problems in the future.


Why would I use return in a dead code anyway?

I used that construction for debugging. If some condition (which should never happen) is met, I set some global flags and terminate the kernel as fast as possible.

After the problem is reported, kernel may crash, hang or do some strange things, from that point I don’t care :) In a final version of my code I wouldn’t use those if statements.

But last few days I was searching for a bug while the source of it was the debugging mechanism itself!

This is rare and uncommon situation that noone will encounter

What I have shown is a simplified example. My concern is, that after some branch statements thread synchronising may go completly off even if it shouldn’t do so. Understanding that __syncthreads() is not working correctly anymore in a big piece of code may take days or even weeks!

Can you prove that the return statement was never taken, e.g. by setting a boolean variable in global memory to “true” if any thread returns?
If your kernel leaves that variable at its default of “false” it has never taken the return path.


Did as you suggested:

if (threadIdx.x>13) {

			if (int(threadIdx.x)-cnd>=64) {






__global__ void myKernel4(int *output,int cnd) {

	int lane=threadIdx.x % 32;

	int warp=threadIdx.x/32;






   for (int i=0; i<warpCount; ++i)

		printf(" %d",cpuOutput[i]);


	for (int i=0; i<warpCount; ++i)

		printf(" %d",cpuOutput[i+32]);


The output is:

127 95 95 95

 -123 -123 -123 -123

If any of the threads would enter the if statement in question, the thread would change the output[warp+32] to 729.