I appear to be getting a read-after-write race condition in a shared memory variable, despite syncing after after the write. Here’s the offending snippet
if(tid .eq. 1) then
if(jeval .eq. 1) then
jeval = 0
hratio = 1.0_rkind
nslp = nsteps + MBETWEEN
drate = 0.7_rkind
end if
l3 = 0
end if
call syncthreads()
call trace(indx,tid,tcount,jeval)
Trace just writes its last argument into a global memory buffer for retrieval by the host. Upon examining the trace, jeval is 0 for thread 1, but still 1 for thread 2. It is as if jeval is not being treated as shared. However, examining the PTX shows the expected shared loads and stores:
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 662 0
@%p2 bra $Lt_1_9986; // <--- skip if tid .ne. 1
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 663 0
ld.shared.s32 %r1981, [jeval];
mov.u32 %r1982, 1;
setp.ne.s32 %p333, %r1981, %r1982;
@%p333 bra $Lt_1_10242; // <--- skip if jeval .ne. 1
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 664 0
mov.s32 %r1983, 0;
st.shared.s32 [jeval], %r1983; // <---- jeval stored as expected
// other variables omitted
$Lt_1_10242:
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 669 0
mov.s32 %r1986, 0;
st.shared.s32 [l3], %r1986;
$Lt_1_9986:
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 671 0
bar.sync 0; // <----------- barrier
.loc 2 1908 0
@%p2 bra $Lt_1_51458;
//<loop> Part of loop body line 1908, head labeled $Lt_1_60674
.loc 2 1909 0
add.s32 %r1987, %r273, 1;
mov.s32 %r273, %r1987;
.loc 2 1910 0
ld.shared.s32 %r1988, [jeval]; // <--------- jeval loaded as expected
add.s32 %r1989, %r6, %r1987;
Has anybody ever seen anything similar to this? Any idea what the cause might be?