Shared mem RAW without sync

I found a way to bypass __syncthreads() in shared memory Read-after-Write. The key is

Warp is hardware preemptive thread synchronized.

So, in the same warp, we could always access any memory many times without any synchronization. The __syncthreads() only need for the inter-warp operations.

By this way, we might insert enough computing code between global memory access and __syncthreads(), and make SIMT to hide global memory latency more properly due to the less tightly binding warps. :lol:

JY 2009

This is indeed a useful and common optimization.
I heartily recommend the SDK examples, especially the Reduction sample. The writeup is excellent, and shows some good examples of eliminating unneeded syncthreads for inner-warp access. There’s a quite nice template optimization in the code for this too.

One big caveat, in the emulator, you still need a syncthread call since the warp size is 1 in emulation mode.

__syncthreads() also flags the compiler not to optimize away the shared memory writes and cache values in registers. So if you really want to rely on this unsupported behavior you will need to make sure to use the volatile keyword where needed.

Is it really unsupported? I mean it is in the reduction example, I would expect the SDK examples to contain supported behavior.