apparent shared memory race condition despite using syncthreads

I appear to be getting a read-after-write race condition in a shared memory variable, despite syncing after after the write. Here’s the offending snippet (apologies for the ugly code; it was generated from PGI’s cuda fortran)

[codebox]

if( ((tid)!=((1)))) goto _BB_210;

if( ((jeval)!=((1)))) goto _BB_208;

jeval = (0);

hratio = (1.00000000e+00f);

nslp = nsteps+(50);

drate = (6.99999988e-01f);

_BB_208: ;

l3 = (0);

_BB_210: ;

__syncthreads();

trace(indx, tid, ((signed char*)(&tcount)), jeval);

[/codebox]

Trace just writes its last argument into a global memory buffer for retrieval by the host. Upon examining the trace, jeval is 0 for thread 1, but still 1 for thread 2. It is as if jeval is not being treated as shared. However, examining the PTX shows the expected shared loads and stores:

[codebox]

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	662	0

@%p2 bra 	$Lt_1_9986;           // <--- skip if tid .ne. 1

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	663	0

ld.shared.s32 	%r1981, [jeval];  

mov.u32 	%r1982, 1;

setp.ne.s32 	%p333, %r1981, %r1982;

@%p333 bra 	$Lt_1_10242;          // <--- skip if jeval .ne. 1

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	664	0

mov.s32 	%r1983, 0;

st.shared.s32 	[jeval], %r1983;           // <---- jeval stored as expected

    // other variables omitted

$Lt_1_10242:

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	669	0

mov.s32 	%r1986, 0;

st.shared.s32 	[l3], %r1986;

$Lt_1_9986:

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	671	0

bar.sync 	0;                                        // <----------- barrier

.loc	2	1908	0

@%p2 bra 	$Lt_1_51458;

// Part of loop body line 1908, head labeled $Lt_1_60674

.loc	2	1909	0

add.s32 	%r1987, %r273, 1;

mov.s32 	%r273, %r1987;

.loc	2	1910	0

ld.shared.s32 	%r1988, [jeval];        // <--------- jeval loaded as expected

add.s32 	%r1989, %r6, %r1987;

[/codebox]

Has anybody ever seen anything similar to this? Any idea what the cause might be?

It’s impossible to say from the little code snippet, so I’ll just throw some general speculation in:

Is there another [font=“Courier New”]__syncthreads()[/font] after the call to trace() to ensure [font=“Courier New”]jeval[/font] is not overwritten before it’s dumped?

Are all [font=“Courier New”]__syncthreads()[/font] outside of conditional code (or at least under conditions that evaluate to the same result in all threads)? The [font=“Courier New”]goto[/font]s in the Fortran style code make this less obvious.

It’s impossible to say from the little code snippet, so I’ll just throw some general speculation in:

Is there another [font=“Courier New”]__syncthreads()[/font] after the call to trace() to ensure [font=“Courier New”]jeval[/font] is not overwritten before it’s dumped?

Are all [font=“Courier New”]__syncthreads()[/font] outside of conditional code (or at least under conditions that evaluate to the same result in all threads)? The [font=“Courier New”]goto[/font]s in the Fortran style code make this less obvious.

Putting another syncthreads in after the trace didn’t change the result.

All of the syncthreads calls in the function should be outside of any thread-dependent conditional code, but I’m double checking them to be sure. I would have expected that if that sort of error were present, then I would have gotten a deadlock, but apparently not: the inconsistent jeval causes the threads to diverge, but they don’t hang at the mismatched synchronization points. So far this seems like the most plausible explanation.

Putting another syncthreads in after the trace didn’t change the result.

All of the syncthreads calls in the function should be outside of any thread-dependent conditional code, but I’m double checking them to be sure. I would have expected that if that sort of error were present, then I would have gotten a deadlock, but apparently not: the inconsistent jeval causes the threads to diverge, but they don’t hang at the mismatched synchronization points. So far this seems like the most plausible explanation.

In case anyone is interested, the culprit here did turn out to be a race condition. I had covered all of the read-after-write cases, but there were some write-after-reads that I had missed. The weird phenomenon I described above was caused by the fact that the barriers created by syncthreads are not distinguishable from one another. That means that if warps diverge and some of them reach one barrier instruction while others reach a different one, all will be released to continue executing, just as if they had all reached a single barrier. In my case that allowed one warp to reach the barrier I quoted with one value of jeval, then all the warps were released, and some time later another warp reached that barrier, by which time jeval had changed.

The punch line to all of this is that just because you made it through a syncthreads, that doesn’t necessarily mean that your warps are “synchronized”. In practice, this can be useful for debugging, since it means that instead of deadlocking instantly, a badly diverged block will lurch along for a few iterations (maybe even all the way to the end, if you’re lucky). If you collect a trace from threads in a couple of different warps, then you can pinpoint the race condition that is causing the problem fairly easily.