Shared memory race condition despite using syncthreads

I appear to be getting a read-after-write race condition in a shared memory variable, despite syncing after after the write. Here’s the offending snippet

    if(tid .eq. 1) then
       if(jeval .eq. 1) then
          jeval  = 0
          hratio = 1.0_rkind
          nslp   = nsteps + MBETWEEN
          drate  = 0.7_rkind
       end if 
       l3 = 0
    end if

    call syncthreads()

    call trace(indx,tid,tcount,jeval)

Trace just writes its last argument into a global memory buffer for retrieval by the host. Upon examining the trace, jeval is 0 for thread 1, but still 1 for thread 2. It is as if jeval is not being treated as shared. However, examining the PTX shows the expected shared loads and stores:

 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	662	0
	@%p2 bra 	$Lt_1_9986;           // <--- skip if tid .ne. 1
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	663	0
	ld.shared.s32 	%r1981, [jeval];  
	mov.u32 	%r1982, 1;
	setp.ne.s32 	%p333, %r1981, %r1982;
	@%p333 bra 	$Lt_1_10242;          // <--- skip if jeval .ne. 1
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	664	0
	mov.s32 	%r1983, 0;
	st.shared.s32 	[jeval], %r1983;           // <---- jeval stored as expected
        // other variables omitted
$Lt_1_10242:
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	669	0
	mov.s32 	%r1986, 0;
	st.shared.s32 	[l3], %r1986;
$Lt_1_9986:
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	671	0
	bar.sync 	0;                                        // <----------- barrier
	.loc	2	1908	0
	@%p2 bra 	$Lt_1_51458;
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
	.loc	2	1909	0
	add.s32 	%r1987, %r273, 1;
	mov.s32 	%r273, %r1987;
	.loc	2	1910	0
	ld.shared.s32 	%r1988, [jeval];        // <--------- jeval loaded as expected
	add.s32 	%r1989, %r6, %r1987;

Has anybody ever seen anything similar to this? Any idea what the cause might be?

Hi Robert,

I’ve sent your question to several others but unfortunately no one has seen this behavior before. One person wondered if your kernel could be failing and asked if you could add a call to ‘cudaThreadSynchronize()’ and ‘cudaGetLastError()’ after your kernel launch on the host.

  • Mat

Mat,

cudaGetLastError() reported successful completion.

Nobody on the nvidia forums had ever seen similar behavior either. I’ve removed the shared variables for now and replaced them with local variables, which cures the divergence problem (at a cost of upping the register count significantly). When I have more time I’ll circle back and try to figure out what was going on.