I appear to be getting a read-after-write race condition in a shared memory variable, despite syncing after after the write. Here’s the offending snippet (apologies for the ugly code; it was generated from PGI’s cuda fortran)
[codebox]
if( ((tid)!=((1)))) goto _BB_210;
if( ((jeval)!=((1)))) goto _BB_208;
jeval = (0);
hratio = (1.00000000e+00f);
nslp = nsteps+(50);
drate = (6.99999988e-01f);
_BB_208: ;
l3 = (0);
_BB_210: ;
__syncthreads();
trace(indx, tid, ((signed char*)(&tcount)), jeval);
[/codebox]
Trace just writes its last argument into a global memory buffer for retrieval by the host. Upon examining the trace, jeval is 0 for thread 1, but still 1 for thread 2. It is as if jeval is not being treated as shared. However, examining the PTX shows the expected shared loads and stores:
[codebox]
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 662 0
@%p2 bra $Lt_1_9986; // <--- skip if tid .ne. 1
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 663 0
ld.shared.s32 %r1981, [jeval];
mov.u32 %r1982, 1;
setp.ne.s32 %p333, %r1981, %r1982;
@%p333 bra $Lt_1_10242; // <--- skip if jeval .ne. 1
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 664 0
mov.s32 %r1983, 0;
st.shared.s32 [jeval], %r1983; // <---- jeval stored as expected
// other variables omitted
$Lt_1_10242:
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 669 0
mov.s32 %r1986, 0;
st.shared.s32 [l3], %r1986;
$Lt_1_9986:
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 671 0
bar.sync 0; // <----------- barrier
.loc 2 1908 0
@%p2 bra $Lt_1_51458;
// Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 1909 0
add.s32 %r1987, %r273, 1;
mov.s32 %r273, %r1987;
.loc 2 1910 0
ld.shared.s32 %r1988, [jeval]; // <--------- jeval loaded as expected
add.s32 %r1989, %r6, %r1987;
[/codebox]
Has anybody ever seen anything similar to this? Any idea what the cause might be?