Update (0:54 UTC): I assumed things about __syncthreads() that were actually unspecified. Thanks to tera for spotting!
Hi all,
I think I’ve found a bug in nvcc.
Operating system: Ubuntu 10.04.3 32-bit, kernel version 2.6.32-33-generic #71-Ubuntu SMP Wed Jul 20 17:30:40 UTC 2011
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:30_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
Host compiler: gcc/g++ 4.4.3, but I’m not sure if it’s used.
System description: AMD Athlon™ 64 X2 Dual Core Processor 4400+, 2 GB RAM, 1x GeForce GTX 280 and no other video card. Chipset unknown, but retrievable if necessary.
Synopsis: __syncthreads() missing in PTX when optimization not applicable.
Detailed description: The following program doesn’t seem to run correctly. I expect it to output *hostint = 1, but it outputs *hostint = 0. Both threads set *int1 = 0. After synchronisation, the first thread sets *int1 = 1, and after another synchronization the other sets *int2 = *int1, which turns out to be 0.
#include <stdio.h>
__global__ void mv_kernel2 (volatile int *int1, volatile int *int2) {
*int1 = 0;
__syncthreads();
int odd = (threadIdx.x % 2);
if (odd) {
*int1 = 1;
__syncthreads();
}
else {
__syncthreads();
*int2 = *int1;
}
}
int main() {
int *hostint = (int*) malloc(sizeof(int));
int *int1_dev, *int2_dev;
int err;
cudaMalloc(&int1_dev,sizeof(int));
cudaMalloc(&int2_dev,sizeof(int));
mv_kernel2<<<1, 2>>>(int1_dev, int2_dev);
err = cudaMemcpy(hostint,int2_dev,1*sizeof(int),cudaMemcpyDeviceToHost);
if (err) {
printf("error: %d\n", err);
exit(1);
}
printf("*hostint = %d\n", *hostint);
cudaFree(int1_dev);
cudaFree(int2_dev);
}
Compile with just nvcc, no options. Also fails using -O0.
If you change else
to if (!odd)
, then the optimization is not applied and the code correctly outputs *hostint = 1.
It seems that the relevant bar.sync PTX instruction is mistakenly optimized away. Diff of the generated PTX (-wrong PTX, +good PTX):
- @%p1 bra $Lt_0_1282;
+ @%p1 bra $Lt_0_1794;
.loc 16 9 0
mov.s32 %r6, 1;
st.volatile.global.s32 [%r1+0], %r6;
+$Lt_0_1794:
.loc 16 10 0
- bra.uni $Lt_0_1026;
-$Lt_0_1282:
+ mov.u32 %r7, 0;
+ setp.ne.s32 %p2, %r4, %r7;
+ @%p2 bra $Lt_0_2306;
+ .loc 16 13 0
+ bar.sync 0;
.loc 16 14 0
- ld.volatile.global.s32 %r7, [%r1+0];
- ld.param.u32 %r8, [__cudaparm__Z10mv_kernel2PViS0__int2];
- st.volatile.global.s32 [%r8+0], %r7;
-$Lt_0_1026:
+ ld.volatile.global.s32 %r8, [%r1+0];
+ ld.param.u32 %r9, [__cudaparm__Z10mv_kernel2PViS0__int2];
+ st.volatile.global.s32 [%r9+0], %r8;
+$Lt_0_2306:
Does this also fail on your systems?