Dear all,
I am having a hard time understanding the branching/synchronization behavior between threads within a warp. I thought i understood it, but yesterday I found a very unexpected bug in my code. It seems that CUDA does not actually force synchronization within a warp when requested with __syncthreads.
I abstracted the problem into a small artificial example shown below. In this example, there are 3 states the warp might reach. The state of the warp is recorded trough atomic instructions. To see in which order the varies code paths are executed, I ran the kernel on a single warp and inspected the final end state.
My expectations where:
-
No thread should reach state 3, as blockIdx.x is never -1.
-
Only thread 0 should reach state 1
-
All but thread 0 should reach state 2
These expectations seem to be met, but my final expectation was wrong:
- Thread 0 should reach state 1 before all other threads reach state 2, hence I expect the final state to be 2.
However, to my surprise, the final state was 1!!
__global__ void
testKernel( int *glb_state )
{
volatile __shared__ int tmp;
if( blockIdx.x != -1 )
{
if( threadIdx.x == 31 )
tmp = threadIdx.x;
__syncthreads();
const int thread_idx = tmp;
// if one is found
if( thread_idx != -1 )
{
// record state 1
if( threadIdx.x == 0 )
atomicExch( glb_state , 1 );
}
else
{
// record state 3
atomicExch( glb_state , 3 );
return;
}
}
__syncthreads();
// record state 2
if( threadIdx.x != 0 )
atomicExch( glb_state , 2 );
}
int main(void )
{
cudaSetDevice( cutGetMaxGflopsDeviceId() );
int hst_state = 0;
int *glb_state;
cudaMalloc( (void**) &glb_state, sizeof(int));
cudaMemcpy( glb_state, &hst_state, sizeof(int),cudaMemcpyHostToDevice);
// setup execution parameters
dim3 grid( 1, 1, 1);
dim3 threads( 32, 1, 1);
// execute the kernel
testKernel<<< grid, threads >>>( glb_state );
cudaMemcpy( &hst_state, glb_state, sizeof(int),cudaMemcpyDeviceToHost);
printf( "End state: %i\n" , hst_state );
}
Even more puzzling, when interchanging the if and else statements:
// if one is found
if( thread_idx == -1 )
{
// record state 3
atomicExch( glb_state , 3 );
return;
}
else
{
// record state 1
if( threadIdx.x == 0 )
atomicExch( glb_state , 1 );
}
the end state changes to the expected state 2.
I inspected the corresponding ptx code but noticed nothing unusual. I also tried both CUDA compilers 2.3 and 3.1 and GPU architectures 1_1 and 2_0, but non of these gives a different result.
Can anybody shed some light on the subject, I am getting very frustrated and confused.
Thanks a lot.
- Dietger