Operating System: Windows XP 64-bit, but host and device code compiled as 32 bit
CUDA toolkit: v. 2.3
SDK: v. 2.3
Compiler: Visual Studio 2008, v. 9.0.21022.8 RTM
deCUDA: v. 0.4.1. with some fast fixes written by myself
Hardware:
CPU Intel Core2 Duo E8600 3.33GHz,
GPU NVIDIA GeForce 9600GT (for display only),
GPU NVIDIA GeForce GTX 285 (for CUDA computation)
Driver: 191.07
Problem: __syncthreads() bars go “out of sync” despite they are put in a branch that is either taken by whole block or by no threads at all.
From Programming Guide:
Consider the following source code:
__device__ void devFunction(int *output, int cnd) {
if (threadIdx.x>13) {
if (int(threadIdx.x)-cnd>=64) //(A)
return;
}
__syncthreads(); //(B1)
output[31]+=1;
__syncthreads(); //(B2)
}
__global__ void myKernel4(int *output,int cnd) {
int lane=threadIdx.x % 32;
int warp=threadIdx.x/32;
devFunction(output,cnd);
__syncthreads();
__shared__ int cc;
cc=0; //(C)
__syncthreads();
atomicAdd(&cc,1);
__syncthreads();
if (lane==0)
output[warp]=cc; //(D)
__syncthreads();
}
void isolatedProblem() {
const int maxBlocks=60;
const int warpCount=4;
int *gpuOutput;
cudaMalloc((void**)&gpuOutput, sizeof(int)*32*warpCount);
int cpuOutput[32*warpCount];
for (int i=0; i<32*warpCount; ++i)
cpuOutput[i]=-1;
myKernel4<<<1,warpCount*32>>>(gpuOutput,10000); //(E)
cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*32*warpCount,cudaMemcpyDeviceToHost);
for (int i=0; i<warpCount; ++i)
printf(" %d",cpuOutput[i]);
}
At position (A) we have a conditional jump out from the funcion devFunction which theoretically could lead to situation that (B1) and (B2) are executed by some but not all threads.
However, if we look on how kernel is invoked (E), we see that this conditon is never taken, therefore everything should work correctly - it fulfills the all-or-none thread requirement quoted from the Programming Guide.
The compiler does not know at compile time about that fact.
To assert that all threads work and sync correctly, we introduce a control shared variable cc, initially set to 0 ( C ). After all threads are synchronised, each thread increments the variable and every warp reports the value to global memory (D).
If everything works as intended, cc should be equal to the dimention of the block and we should see the same value in the output for every warp. However if we compile and run the above code, we get:
127 95 95 96
Which shows the threads are not synchronised correctly!
If you remove the ‘return’ instruction however (which is never executed anyway) the output is correct.
128 128 128 128
Further investigation to the problem:
In an attept to track the problem I examined the produced PTX code:
.entry _Z9myKernel4Pii (
.param .u32 __cudaparm__Z9myKernel4Pii_output,
.param .s32 __cudaparm__Z9myKernel4Pii_cnd)
{
.reg .u32 %rv1;
.reg .u32 %r<20>;
.reg .u64 %rd<4>;
.reg .pred %p<5>;
.shared .s32 __cuda_cc0;
$LBB1__Z9myKernel4Pii:
cvt.u32.u16 %r1, %tid.x;
mov.u32 %r2, 13;
setp.le.u32 %p1, %r1, %r2;
@%p1 bra $Lt_2_2562; //if (threadIdx.x<=13) skip to $Lt_2_2562
ld.param.s32 %r3, [__cudaparm__Z9myKernel4Pii_cnd];
sub.s32 %r4, %r1, %r3; //%r4:=threadIdx.x-cnd
mov.u32 %r5, 63;
setp.gt.s32 %p2, %r4, %r5;
@%p2 bra $Lt_2_258; //if (%r4>63) skip to $Lt_2_258
$Lt_2_2562:
bar.sync 0;
ld.param.u32 %r6, [__cudaparm__Z9myKernel4Pii_output];
ld.global.s32 %r7, [%r6+124]; //output[31]+=1
add.s32 %r8, %r7, 1;
st.global.s32 [%r6+124], %r8;
$Lt_2_258:
//end of devFunction
bar.sync 0;
mov.s32 %r9, 0;
st.shared.s32 [__cuda_cc0], %r9; //cc=0
bar.sync 0;
mov.u32 %r10, __cuda_cc0;
mov.s32 %r11, 1;
atom.shared.add.s32 %rv1, [%r10], %r11; //atomicAdd(&cc,1)
bar.sync 0;
and.b32 %r12, %r1, 31;
mov.u32 %r13, 0;
setp.ne.s32 %p3, %r12, %r13;
@%p3 bra $Lt_2_3074; //if (lane!=0) skip to $Lt_2_3074
ld.shared.s32 %r14, [__cuda_cc0];
ld.param.u32 %r15, [__cudaparm__Z9myKernel4Pii_output];
shr.u32 %r16, %r1, 5;
cvt.u64.s32 %rd1, %r16;
mul.lo.u64 %rd2, %rd1, 4;
cvt.s32.u64 %r17, %rd2;
add.u32 %r18, %r15, %r17;
st.global.s32 [%r18+0], %r14;
$Lt_2_3074:
exit;
$LDWend__Z9myKernel4Pii:
}
I added some comments for readability and removed .loc debug instructions. I didn’t find any compiler errors at this point, but if you have time, please do double-check my statement :)
Further investigation involved digging into produced cubin file. To that end I used decuda. It is not perfect, on rare occasions some instructions are misinterpreted, but the important part seems to be parsed correctly:
.entry _Z9myKernel4Pii
{
.lmem 0
.smem 28
.reg 3
.bar 1
cvt.u32.u16 $r0, $r0.lo //$r0:=int(threadIdx.x)
//devFunction
set.le.u32 $p0|$o127, $r0, c1[0x0000] //$p0:=(threadIdx.x<=13)
join.label label1
@$p0.ne bra.label label0
subr.u32 $r1, s[0x0014], $r0 //$r1:=-cnd+threadIdx.x
set.gt.s32 $p0|$o127, $r1, c1[0x0004] //$p0:=(cnd-threadIdx.x>63)
@$p0.ne bra.label label1 //exit devFunction
label0:
bar.sync.u32 0x00000000 //__syncthreads
add.b32 $r2, s[0x0010], 0x0000007c //$r2:=output+31*sizeof(int)
mov.u32 $r1, g[$r2]
add.b32 $r1, $r1, 0x00000001
mov.u32 g[$r2], $r1
//end of devFunction
label1: nop.join
bar.sync.u32 0x00000000
mov.b32 s[0x0018], $r124
bar.sync.u32 0x00000000
join.label label4
label2: mov.b32 $p0|$r1, s[0x0018]// (unk1 00800000)
@$p0.lt bra.label label3
bra.label label2
label3: add.b32 $r1, $r1, 0x00000001
mov.b32 s[0x0018], $r1
mov.?5?.b32 s[0x0018], $r1
label4: nop.join
bar.sync.u32 0x00000000
and.b32 $p0|$o127, $r0, c1[0x0008]
@$p0.ne return
shr.u32 $r0, $r0, 0x00000005
shl.u32 $r1, $r0, 0x00000002
mov.half.b32 $r0, s[0x0018]
add.half.b32 $r1, s[0x0010], $r1
mov.end.u32 g[$r1], $r0
#.constseg 1:0x0000 const
#{
#d.32 0x0000000d, 0x0000003f, 0x0000001f // 0000
#}
}
And again the code seems to be correct (please double-check).
Conclusion:
So - where is the problem? Driver? Hardware?
Or maybe situation where __syncthreads() may be used is more restrictive? But how exactly ?
Even if this problem cannot be corrected, I believe it should be understood more in-depth and the results made public to avoid similar coding problems in the future.
FAQ
Why would I use return in a dead code anyway?
I used that construction for debugging. If some condition (which should never happen) is met, I set some global flags and terminate the kernel as fast as possible.
After the problem is reported, kernel may crash, hang or do some strange things, from that point I don’t care :) In a final version of my code I wouldn’t use those if statements.
But last few days I was searching for a bug while the source of it was the debugging mechanism itself!
This is rare and uncommon situation that noone will encounter
What I have shown is a simplified example. My concern is, that after some branch statements thread synchronising may go completly off even if it shouldn’t do so. Understanding that __syncthreads() is not working correctly anymore in a big piece of code may take days or even weeks!