CUDA warp branching problem Stange bug due to unexpected warp branching order

Dear all,

I am having a hard time understanding the branching/synchronization behavior between threads within a warp. I thought i understood it, but yesterday I found a very unexpected bug in my code. It seems that CUDA does not actually force synchronization within a warp when requested with __syncthreads.

I abstracted the problem into a small artificial example shown below. In this example, there are 3 states the warp might reach. The state of the warp is recorded trough atomic instructions. To see in which order the varies code paths are executed, I ran the kernel on a single warp and inspected the final end state.

My expectations where:

  • No thread should reach state 3, as blockIdx.x is never -1.

  • Only thread 0 should reach state 1

  • All but thread 0 should reach state 2

These expectations seem to be met, but my final expectation was wrong:

  • Thread 0 should reach state 1 before all other threads reach state 2, hence I expect the final state to be 2.

However, to my surprise, the final state was 1!!

__global__ void

testKernel( int *glb_state ) 

{

	volatile __shared__ int tmp;

	if( blockIdx.x != -1 )

	{

		if( threadIdx.x == 31 )

			tmp = threadIdx.x;

		__syncthreads();

		const int thread_idx = tmp;

		// if one is found

		if( thread_idx != -1 )

		{

			// record state 1

			if( threadIdx.x == 0 )

				atomicExch( glb_state , 1 );

		}

		else

		{

			// record state 3

			atomicExch( glb_state , 3 );

			return;

		}

	}

	__syncthreads();

	// record state 2

	if( threadIdx.x != 0 )

		atomicExch( glb_state , 2 );

}

int main(void ) 

{

	cudaSetDevice( cutGetMaxGflopsDeviceId() );

	int hst_state = 0; 

	int *glb_state;

	cudaMalloc( (void**) &glb_state, sizeof(int));

	cudaMemcpy( glb_state, &hst_state, sizeof(int),cudaMemcpyHostToDevice);

	// setup execution parameters

	dim3  grid( 1, 1, 1);

	dim3  threads( 32, 1, 1);

	// execute the kernel

	testKernel<<< grid, threads >>>( glb_state );

	cudaMemcpy( &hst_state, glb_state, sizeof(int),cudaMemcpyDeviceToHost);

	printf( "End state: %i\n" , hst_state );

}

Even more puzzling, when interchanging the if and else statements:

// if one is found

		if( thread_idx == -1 )

		{

			// record state 3

			atomicExch( glb_state , 3 );

			return;

		}

		else

		{

			// record state 1

			if( threadIdx.x == 0 )

				atomicExch( glb_state , 1 );

			

		}

the end state changes to the expected state 2.

I inspected the corresponding ptx code but noticed nothing unusual. I also tried both CUDA compilers 2.3 and 3.1 and GPU architectures 1_1 and 2_0, but non of these gives a different result.

Can anybody shed some light on the subject, I am getting very frustrated and confused.

Thanks a lot.

  • Dietger

Hello,

Here I can’t test this code (no atomics), but this is indeed really annoying !
Perhaps, try to “cuInit(0);” at the beginning of main, who knows… ?

I tested your code, and besides the fact that it has a LOT of unreachable code, and I don’t know where you expect the -1 to be set in any case, it returns the expected value of 2 on my machine (compute 1.1 in this case, on a laptop)

Well that’s weird, this exactly code produces an end state of 1 on my GPU, no matter what I tried. I know about the unreachable code, it seemed necessary to cause the unexpected behavior. Also, these -1 values are just to guaranty a certain execution flow without the compiler optimizing the branches out.

Curiouser and curiouser!

I initially tested the code on a GTX 470, giving the unexpected results (state 1). I just tested the same code on a Quadro NVS 450, giving the expected results (state 2). Apparently, there is difference in the branching behavior of newer and older GPU’s, independent of the target architecture used for compiling.

Someone any idea what could cause this and if this is normal or a bug?

You are aware that you are deeply into undefined behavior and thus this can, by definition, not be a bug? All threads have to reach the same sequence of __syncthreads() calls. So if threads exit early, without reaching the final __syncthreads(), all bets are off as to what will happen.

But no threads are supposed to exit early (and if one does, it means that thread_idx == -1, so it’s a bug anyway)… no ?

Have you tried to properly initialize cuda with cuInit(0) ? It remembers me a weird bug I encountered, and I simply solved it this way.

Adding cuInit(0) does not seem to do the trick.

I agree with cuda_libre that non of the threads should exit early, so all threads eventually reach both __syncthreads. Also when inspecting the compiled ptx code, all code paths (except for state 3, which is never reached anyways) still pass both __syncthreads. Therefor, it should be save to expect that all threads first reach state 1 before reaching state 2. To make this more explicit, I added some work for the ‘other’ threads in both states 1 and 2 by recording a second state B.

__global__ void

testKernel( int *glb_stateA , int *glb_stateB ) 

{

	volatile __shared__ int tmp;

	if( blockIdx.x != -1 )

	{

		if( threadIdx.x == 31 )

			tmp = threadIdx.x;

		__syncthreads();

		const int thread_idx = tmp;

		// if one is found

		if( thread_idx != -1 )

		{

			// record state 1

			if( threadIdx.x == 0 )

				atomicExch( glb_stateA , 1 );

			else // threadIdx.x != 0

				atomicExch( glb_stateB , 1 );

		}

		else

		{

			// record state 3

			atomicExch( glb_stateA , 3 );

			atomicExch( glb_stateB , 3 );

			return;

		}

	}

	__syncthreads();

	// record state 2

	if( threadIdx.x != 0 )

		atomicExch( glb_stateA , 2 );

	else // threadIdx.x == 0

		atomicExch( glb_stateB , 2 );

}

int main(void ) 

{

	cudaSetDevice( cutGetMaxGflopsDeviceId() );

	int hst_state[2] = {0,0}; 

	int *glb_state;

	cudaMalloc( (void**) &glb_state, sizeof(int) * 2);

	cudaMemcpy( glb_state, hst_state, sizeof(int) * 2,cudaMemcpyHostToDevice);

	// setup execution parameters

	dim3  grid( 1, 1, 1);

	dim3  threads( 32, 1, 1);

	// execute the kernel

	testKernel<<< grid, threads >>>( &glb_state[0] , &glb_state[1] );

	cudaMemcpy( hst_state, glb_state, sizeof(int) * 2,cudaMemcpyDeviceToHost);

	printf( "End state A: %i\n" , hst_state[0] );

	printf( "End state B: %i\n" , hst_state[1] );

}

The final states are (again unless I swap the if and else parts):

End state A: 1

End state B: 2

Reasoning from this:

End state A = 1 implies that:

  • Threads 1…31 reach state 2 before thread 0 reaches state 1

End state B = 2 implies that:

  • Threads 1…31 reach state 1 before thread 0 reaches state 2

The only way to resolve this contradiction is to conclude that a state for A is not necessary concurrently with the same state for B or, or in other words, at least one of these states is split in two separate states, one for thread 0 and one for threads 1…31. Again, this splitting is not explicit in the ptx code and must be a result of GPU branching decisions.

All in all, I think this behavior is very confusing and makes it impossible to safely implement synchronized algorithms that depend on order within warps/blocks for communication. For example when some shared data is set in state 1 by thread 0 that is needed again by all threads in state 2 (This was the actual problem in my algorithm). I do not know for sure if this is actually a bug, but at least its a definite weakness. I do not yet see how to prevent these kind of bugs.

Code that depends on ordering across blocks or warps (with the exception of warps within a block that use only __syncthreads for ordering requirements) is considered invalid in the CUDA execution model. I’ll look at your code a bit more later to see if I can explain what’s going on.

That’s true, the problem is that in my code example, I am actually using __syncthreads to force ordering within a block(or so I hoped). I am very interested in why this is not invalid behavior, looking forward to your explanation.

I didn’t catch what you mean by “ordering threads within a block” here

you mean this not valid to write things like

if(threadIdx.x == number) {

[...]

}

__syncthreads();

is not valid ?

Well, it seems that there is no guaranty that thread [number] will process […] first, before any thread passes the __syncthreads barrier. Your example works fine, but in the more complex example a gave above, it no longer holds. Why the one works and the other doesn’t is unclear to me.

Ok, that’s right. I didn’t look close enough before posting. This indeed seems to be an excellent test case.

After a closer look, I think this looks like a bug in the JIT ptx compiler. Using a slightly modified kernel that makes sure the problem is not related to signed/unsigned or to atomicExch() issues:

[codebox]

extern “C” global void

testKernel( int *glb_stateA , int *glb_stateB )

{

__shared__ int tmp;

if( (blockIdx.x != 65537) && (threadIdx.x == 31) )

    tmp = threadIdx.x;

__syncthreads();

if ( blockIdx.x != 65536 )

{

    const int thread_idx = tmp;

// if one is found

    if( thread_idx != 1111 )

    {

        // record state 1

        if( threadIdx.x == 0 )

            *glb_stateA = 1;

        else // threadIdx.x != 0

            *glb_stateB = 1;

    }

    else

    {

        // record state 3

        *glb_stateA = 3;

        *glb_stateB = 3;

        return;

    }

}

__syncthreads();

// record state 2

if( threadIdx.x != 0 )

    *glb_stateA = 2;

else // threadIdx.x == 0

    *glb_stateB = 2;

}

[/codebox]

this compiles to the following .cubin kernel:

[codebox]

// Disassembling testKernel

000000: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo

000008: a0004c01 04200780 cvt.rn.u32.u16 $r0, s[0x000c]

000010: 308003fd 644087c8 set.eq.u32 $p0|$o127, $r1, c1[0x0000]

000018: 308101fd 644142c8 @$p0.ne set.ne.u32 $p0|$o127, $r0, c1[0x0004]

000020: 00001001 e4204680 @$p0.neu mov.b32 s[0x0020], $r1

000028: 861ffe03 00000000 bar.sync.u32 0x00000000

000030: 308201fd 644087c8 set.eq.u32 $p0|$o127, $r0, c1[0x0008]

000038: 1001a003 00000280 @$p0.ne bra.label label2

000040: 3083d1fd 6c6087c8 set.eq.s32 $p0|$o127, s[0x0020], c1[0x000c]

000048: 10014003 00000280 @$p0.ne bra.label label1

000050: 307c03fd 640147c8 set.ne.u32 $p0|$o127, $r1, $r124

000058: 10010003 00000280 @$p0.ne bra.label label0

000060: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000068: 10018009 00000003 mov.b32 $r2, 0x00000001

000070: d00e0009 a0c00780 mov.u32 g[$r0], $r2

000078: 1001a003 00000780 bra.label label2

000080: 10000c01 4400c780 label0: mov.b32 $r0, s[0x0018]

000088: 10018009 00000003 mov.b32 $r2, 0x00000001

000090: d00e0009 a0c00780 mov.u32 g[$r0], $r2

000098: 1001a003 00000780 bra.label label2

0000a0: 10000801 4400c780 label1: mov.b32 $r0, s[0x0010]

0000a8: 10038005 00000003 mov.b32 $r1, 0x00000003

0000b0: d00e0005 a0c00780 mov.u32 g[$r0], $r1

0000b8: 10000c01 4400c780 mov.b32 $r0, s[0x0018]

0000c0: d00e0005 a0c00780 mov.u32 g[$r0], $r1

0000c8: 30000003 00000780 return

0000d0: 861ffe03 00000000 label2: bar.sync.u32 0x00000000

0000d8: 307c03fd 640087c8 set.eq.u32 $p0|$o127, $r1, $r124

0000e0: 10021003 00000280 @$p0.ne bra.label label3

0000e8: 10000801 4400c780 mov.b32 $r0, s[0x0010]

0000f0: 10028005 00000003 mov.b32 $r1, 0x00000002

0000f8: d00e0005 a0c00780 mov.u32 g[$r0], $r1

000100: 30000003 00000780 return

000108: 10000c01 4400c780 label3: mov.b32 $r0, s[0x0018]

000110: 10028005 00000003 mov.b32 $r1, 0x00000002

000118: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 0000001f 00010001 00010000 00000457

[/codebox]

which exhibits the bad behavior.

The issue is fixed by inserting joining instructions:

[codebox]

// Disassembling testKernel

000000: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo

000008: a0004c01 04200780 cvt.rn.u32.u16 $r0, s[0x000c]

000010: 308003fd 644087c8 set.eq.u32 $p0|$o127, $r1, c1[0x0000]

000018: 308101fd 644142c8 @$p0.ne set.ne.u32 $p0|$o127, $r0, c1[0x0004]

000020: 00001001 e4204680 @$p0.neu mov.b32 s[0x0020], $r1

000028: 861ffe03 00000000 bar.sync.u32 0x00000000

000030: 308201fd 644087c8 set.eq.u32 $p0|$o127, $r0, c1[0x0008]

000038: a001b003 00000000 join.label label2

000040: 1001b003 00000280 @$p0.ne bra.label label2

000048: 3083d1fd 6c6087c8 set.eq.s32 $p0|$o127, s[0x0020], c1[0x000c]

000050: 10015003 00000280 @$p0.ne bra.label label1

000058: 307c03fd 640147c8 set.ne.u32 $p0|$o127, $r1, $r124

000060: 10011003 00000280 @$p0.ne bra.label label0

000068: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000070: 10018009 00000003 mov.b32 $r2, 0x00000001

000078: d00e0009 a0c00780 mov.u32 g[$r0], $r2

000080: 1001b003 00000780 bra.label label2

000088: 10000c01 4400c780 label0: mov.b32 $r0, s[0x0018]

000090: 10018009 00000003 mov.b32 $r2, 0x00000001

000098: d00e0009 a0c00780 mov.u32 g[$r0], $r2

0000a0: 1001b003 00000780 bra.label label2

0000a8: 10000801 4400c780 label1: mov.b32 $r0, s[0x0010]

0000b0: 10038005 00000003 mov.b32 $r1, 0x00000003

0000b8: d00e0005 a0c00780 mov.u32 g[$r0], $r1

0000c0: 10000c01 4400c780 mov.b32 $r0, s[0x0018]

0000c8: d00e0005 a0c00780 mov.u32 g[$r0], $r1

0000d0: 30000003 00000780 return

0000d8: f0000001 e0000002 label2: nop.join

0000e0: 861ffe03 00000000 bar.sync.u32 0x00000000

0000e8: 307c03fd 640087c8 set.eq.u32 $p0|$o127, $r1, $r124

0000f0: 10023003 00000280 @$p0.ne bra.label label3

0000f8: 10000801 4400c780 mov.b32 $r0, s[0x0010]

000100: 10028005 00000003 mov.b32 $r1, 0x00000002

000108: d00e0005 a0c00780 mov.u32 g[$r0], $r1

000110: 30000003 00000780 return

000118: 10000c01 4400c780 label3: mov.b32 $r0, s[0x0018]

000120: 10028005 00000003 mov.b32 $r1, 0x00000002

000128: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 0000001f 00010001 00010000 00000457

[/codebox]

I’m surprised though because [font=“Courier New”]if ( blockIdx.x != 65536 )[/font] should not lead to a diverging warp. However, since all this is undocumented stuff, I think only Nvidia staff can shed more light on this.

I execute your code on my machine,

End state A = 2

End state B = 2

on TeslaC1060 and GTX295 under CUDA 2.3, 64-bit

The result is correct.

@tera: Thanks for confirming my findings, makes me believe I’m still sane :)

@LSChien: The problem seems to only appears on the gtx 400 series (I tried a 470 and 480).

I hope that this is just a JIT compiler bug that will be fixed soon. It would certainly be nice if someone from NVidia staff could confirm/oppose this theory.

I used a GTX 260 and CUDA 2.1 to reproduce the problem.

I think I understand now what is happening here: For some reason the JIT compiler misses to create a joining point after the [font=“Courier New”]if( threadIdx.x == 0 )[/font] conditional, as the disassembled listing shows.

According to Demystifying GPU Microarchitecture through Microbenchmarking the [font=“Courier New”]else[/font] clause of the conditional (i.e. threadIdx.x != 0) is taken first, so that [font=“Courier New”]glb_stateB[/font] is set to 1. After the missing joining point the [font=“Courier New”]__syncthreads()[/font] immediately succeeds as we are in the only warp of the block, and [font=“Courier New”]glb_stateA[/font] is set to 2 before the threads exit. Then the remaining clause (threadIdx.x == 0) is executed: [font=“Courier New”]glb_stateA[/font] gets set to 1, the [font=“Courier New”]__syncthreads()[/font] again returns immediately and [font=“Courier New”]glb_stateB[/font] is set to 2. When execution of the kernel now ends, we have the bad result [font=“Courier New”]glb_stateA==1[/font] and [font=“Courier New”]glb_stateB==2[/font].

As we have already found out, for this particular testcase the bug can be circumvented in a few ways:

    My manually inserted joining point was not in the correct (optimal) position, but as it was still before the second [font=“Courier New”]__syncthreads()[/font] it was early enough to ensure correct execution.

    If the [font=“Courier New”]return[/font] statement in the untaken branch is removed, the JIT compiler correctly inserts the joining point, so everything works as expected.

    If the order of the clauses in the [font=“Courier New”]if( threadIdx.x == 0 )[/font] conditional is reversed, for some reason the compiler resorts to predicated instructions instead of branching, thus preventing divergence and leading to the correct behavior.

Thanks, Dietepiet, for providing this excellent testcase. I think it’s up to Nvidia now to find and fix the bug that leads to the missed reconvergence point (unless it is already fixed in the latest drivers, maybe someone can try this out).

I’ll test it tomorrow with the latest compiler and file a bug if it’s still broken. I suspected a bug initially, although I didn’t want to confirm it as such without more investigation. When people have brought this kind of issue up in the past, the problem was always related to race conditions in which part of a warp executes an intra-warp branch first. This didn’t look like that because this test case is straightforward, but I wanted to be sure.

Note that my previous statement about scheduling still holds.