syncthreads drops back to main in emulation mode

I’m trying to debug a kernel of mine in device emulation mode, but at a certain point, it drops back to main() on a syncthreads call for no apparent reason. The kernel itself is very long, calling several other device functions; it is in one of these functions, add7, where the problem occurs. Not every call to syncthreads causes this to occur, nor every call to add7, but the timing of the error (with respect to location in the program’s dataflow) is consistent.

Here’s add7:

#define tidx threadIdx.x

__device__ void add7(float* dest,float* a,float k2,float* b) {

    if (tidx == 0) printf("hello from add7 A+k2B\n");

    if (tidx < 7)

        dest[tidx] = a[tidx] + k2*b[tidx];

    if (tidx == 0) printf("presync from add7 A+k2B\n");

    __syncthreads();

    if (tidx == 0) printf("goodbye from add7 A+k2B\n");

}

And printfs from the program’s execution (there are tracing statements from other functions in here as well):

Line search iteration 2:

hello from add7 A+k2B

presync from add7 A+k2B

goodbye from add7 A+k2B

Line search iteration 3: at [-0.000000,0.000132,0.000000,0.990001,-0.000002,-0.000009,-0.000011], lambda = 0.010000, f=-356.670593

hello from add7 A+k2B

presync from add7 A+k2B

end of kernel execution in main()

Valgrind doesn’t show much untoward happening, except that immediately after the last printf from add7 (and before the one from main), i see errors of the following type:

==19526==

==19526== Conditional jump or move depends on uninitialised value(s)

==19526== at 0x5847494: __printf_fp (in /lib/libc-2.7.so)

==19526== by 0x58440F3: vfprintf (in /lib/libc-2.7.so)

==19526== by 0x584BED9: printf (in /lib/libc-2.7.so)

==19526== by 0x4035D3: main (gpuopt.cu:393)

==19526==

Any ideas why __syncthreads() would unceremoniously terminate my kernel like that? I assume it would be related to me walking over memory somewhere, but I don’t quite see how.

I ran in to a similar problem. It appears to be caused when you perform a __syncthreads() from within a block of code that not all threads access. Bringing any calls to outside of thread-restricted block fixed the issue for me, even if it was just to invoke the function.