Can't synchronize after atomic operations Compiler removes synchronization

Hello:

CUDA compiler 2.2 won’t let me synchronize after atomic operations, causing a warning:

“Advisory: Removed dead synchronization intrinsic”.

The code I’m using to test this is:

__global__ static void test(int* var)

{

	__shared__ int acc;

	if(threadIdx.x==0)

		acc = 0;

	__syncthreads();

	atomicAdd(&acc,1);

	__syncthreads();

	if(threadIdx.x==0)

		var[0] = acc;

}

I call this function with 256 threads. It should return n-1, 255 in this case.

Assembler removes the “__syncthreads();” after the atomicAdd, causing variable “acc” to take a random value when thread 0 stores it. Emulator compiles it ok.

I’m using Tesla C1060, with Ubuntu 8.04 and CUDA 2.2.

That compiles fine without warning using the nvcc from the 2.3 toolkit:

.entry _Z4testPi (

		.param .u64 __cudaparm__Z4testPi_var)

	{

	.reg .u32 %rv1;

	.reg .u32 %r<7>;

	.reg .u64 %rd<4>;

	.reg .pred %p<3>;

	.shared .s32 __cuda_acc0;

	.loc	16	1	0

$LBB1__Z4testPi:

	cvt.u32.u16 	%r1, %tid.x;

	mov.u32 	%r2, 0;

	setp.eq.u32 	%p1, %r1, %r2;

	@!%p1 bra 	$Lt_0_1794;

	.loc	16	6	0

	mov.s32 	%r3, 0;

	st.shared.s32 	[__cuda_acc0], %r3;

$Lt_0_1794:

	.loc	16	8	0

	bar.sync 	0;

	.loc	15	103	0

	mov.u64 	%rd1, __cuda_acc0;

	mov.s32 	%r4, 1;

	atom.shared.add.s32 	%rv1, [%rd1], %r4;

	.loc	16	12	0

	bar.sync 	0;

	@!%p1 bra 	$Lt_0_2306;

	.loc	16	15	0

	ld.shared.s32 	%r5, [__cuda_acc0];

	ld.param.u64 	%rd2, [__cudaparm__Z4testPi_var];

	st.global.s32 	[%rd2+0], %r5;

$Lt_0_2306:

	.loc	16	16	0

	exit;

$LDWend__Z4testPi:

	} // _Z4testPi

You can see both sync instructions make it into the ptx.

I’ll perform an upgrade as soon as possible then.

Thank you.