Shared memory updates are only guaranteed to be “visible” to the thread they were issued in. To make them visible to other threads in the block, it is necessary to have a synchronization point, of some sort. This is not actually a synchronization issue, but a compiler optimization issue. Specifically this update:
s_flux[idx] = 1.0;
will not necessarily be visible to any other thread reading the value here:
d_w[id] = 7.0 + s_flux[idx+1];
unless a specific synchronization or barrier is established between those two points in the code.
The compiler is free to optimize shared memory values into registers if it so chooses. Those optimizations can be performed at any time. In order to force an update from another thread to be visible, you can do one of the following things:
- compile your code with -G, which disables most optimizations
- add the “volatile” keyword to the shared memory declaration, which forces the compiler to forgo this optimization in general (any shared memory transaction will be facilitated by an access to shared memory, rather than usage of a register copy.)
- add a __syncthreads() after the update of shared memory:
s_flux[idx] = 1.0;
__syncthreads();
Not only is __syncthreads() a thread barrier, but it also has the effect of forcing shared memory updates to be visible to all threads in the block. Note the programming guide description of __syncthreads():
[url]Programming Guide :: CUDA Toolkit Documentation
“…and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.”
So not only is it a thread execution barrier, but it is a memory barrier as well. (An explicit memory barrier function, such as __threadfence_block() would work here also.)
- you could also have each thread update the shared memory value it will actually use later:
s_flux[(idx+1)%blockDim.x] = 1.0;
since that value is then guaranteed to be visible to that thread.
The addition or removal of the __syncthreads() that you have in your code does affect the issue, but only as a side-effect of the compiler optimization choices.
You can also spot this issue by running cuda-memcheck with the --tool racecheck option. When I do this on CUDA 6.5 it reports something like this:
========= Race reported between Write access at 0x00000090 in G_te_x(void)
========= and Read access at 0x000000a8 in G_te_x(void) [248 hazards]
========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)
but if I add the __syncthreads() after the update to shared memory (2 above), or have each thread update the value it will use later (4 above), I get no warnings.