Bug report: __syncthreads() mistakenly optimized away

Update (0:54 UTC): I assumed things about __syncthreads() that were actually unspecified. Thanks to tera for spotting!

Hi all,

I think I’ve found a bug in nvcc.

Operating system: Ubuntu 10.04.3 32-bit, kernel version 2.6.32-33-generic #71-Ubuntu SMP Wed Jul 20 17:30:40 UTC 2011

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2011 NVIDIA Corporation

Built on Thu_May_12_11:09:30_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221

Host compiler: gcc/g++ 4.4.3, but I’m not sure if it’s used.

System description: AMD Athlon™ 64 X2 Dual Core Processor 4400+, 2 GB RAM, 1x GeForce GTX 280 and no other video card. Chipset unknown, but retrievable if necessary.

Synopsis: __syncthreads() missing in PTX when optimization not applicable.

Detailed description: The following program doesn’t seem to run correctly. I expect it to output *hostint = 1, but it outputs *hostint = 0. Both threads set *int1 = 0. After synchronisation, the first thread sets *int1 = 1, and after another synchronization the other sets *int2 = *int1, which turns out to be 0.

#include <stdio.h>

__global__ void mv_kernel2 (volatile int *int1, volatile int *int2) {

    *int1 = 0;

    __syncthreads();

int odd = (threadIdx.x % 2);

    if (odd) {

        *int1 = 1;

        __syncthreads();

    }

    else {

        __syncthreads();

        *int2 = *int1;

    }

}

int main() {

    int *hostint = (int*) malloc(sizeof(int));

    int *int1_dev, *int2_dev;

    int err;

    cudaMalloc(&int1_dev,sizeof(int));

    cudaMalloc(&int2_dev,sizeof(int));

    mv_kernel2<<<1, 2>>>(int1_dev, int2_dev);

    err = cudaMemcpy(hostint,int2_dev,1*sizeof(int),cudaMemcpyDeviceToHost);

    if (err) {

        printf("error: %d\n", err);

        exit(1);

    }

    printf("*hostint = %d\n", *hostint);

    cudaFree(int1_dev);

    cudaFree(int2_dev);

}

Compile with just nvcc, no options. Also fails using -O0.

If you change else to if (!odd), then the optimization is not applied and the code correctly outputs *hostint = 1.

It seems that the relevant bar.sync PTX instruction is mistakenly optimized away. Diff of the generated PTX (-wrong PTX, +good PTX):

-	@%p1 bra 	$Lt_0_1282;

+	@%p1 bra 	$Lt_0_1794;

 	.loc	16	9	0

 	mov.s32 	%r6, 1;

 	st.volatile.global.s32 	[%r1+0], %r6;

+$Lt_0_1794:

 	.loc	16	10	0

-	bra.uni 	$Lt_0_1026;

-$Lt_0_1282:

+	mov.u32 	%r7, 0;

+	setp.ne.s32 	%p2, %r4, %r7;

+	@%p2 bra 	$Lt_0_2306;

+	.loc	16	13	0

+	bar.sync 	0;

 	.loc	16	14	0

-	ld.volatile.global.s32 	%r7, [%r1+0];

-	ld.param.u32 	%r8, [__cudaparm__Z10mv_kernel2PViS0__int2];

-	st.volatile.global.s32 	[%r8+0], %r7;

-$Lt_0_1026:

+	ld.volatile.global.s32 	%r8, [%r1+0];

+	ld.param.u32 	%r9, [__cudaparm__Z10mv_kernel2PViS0__int2];

+	st.volatile.global.s32 	[%r9+0], %r8;

+$Lt_0_2306:

Does this also fail on your systems?

Short answer: Your example code has undefined semantics, because according to the Programming Guide, appendix B.6

Long answer: To fully understand what is going on here, we need to look at the more precisely defined semantics of the [font=“Courier New”]bar.sync[/font] PTX instruction. While according to the above quote on the C level [font=“Courier New”]__syncthreads()[/font] just isn’t allowed within conditional code that is not uniformly executed, PTX ISA 2.3 states

This implies that in your case the [font=“Courier New”]bar.sync[/font] has no effect, because it is defined to only sync between warps, and your block has only one warp. (Threads within a warp already execute in sync, and trying to sync different code paths within a warp would lead to deadlocks. Above semantics of [font=“Courier New”]bar.sync[/font] is carefully chosen to avoid these deadlocks).

Also note that the order with which different threads of a block execute the “if” or “else” clause relative to each other is not defined. So far it just happens to be that within a warp the “else” clause is always executed first. So even if the compiler left the [font=“Courier New”]bar.sync[/font] in the PTX code, your program would not generate the result you expect.

If “[font=“Courier New”]else[/font]” is replaced with “[font=“Courier New”]if (!odd)[/font]”, the order of the two stores within a warp is reversed, and you get the results you expect.

The optimization by the compiler is also correct. As mentioned above, the Programming Guide requires for [font=“Courier New”]__syncthreads()[/font] in conditional code that the conditional evaluates identically across the entire thread block. Under this invariant, both [font=“Courier New”]__syncthreads()[/font] within the conditional code are redundant and can be optimized away. The first, because repeated [font=“Courier New”]__syncthreads()[/font] without intermediate memory accesses are redundant. The second, because [font=“Courier New”]__syncthreads()[/font] at the end of a kernel has no effect.

If “[font=“Courier New”]else[/font]” is replaced with “[font=“Courier New”]if (!odd)[/font]”, only one [font=“Courier New”]__syncthreads()[/font] is redundant because (at the level of sophistication with which the compiler analyses the code) there now potentially is an additional write to [font=“Courier New”]*int1[/font] between the [font=“Courier New”]__syncthreads()[/font]. Note that the compiler now optimizes only one of them away but not the other.

EDIT: Grammar

You’re absolutely right. Thanks for the quick and long reply!

I read that section of the CUDA C programming guide, but I guess I didn’t realize it was that strict! I guess I’ll have to refactor my code a bit.