__syncthreads and blocking memory accesses

Hi gang,

I was wondering about placement of __syncthreads with respect to shared memory writes (e.g. loads from global memory) and was curious whether the streaming multiprocessors implement non-blocking loads for a given warp.

Assume the following sequence of events:

< calculate read addresses >
< load from global memory to shared memory >
< read shared memory >
< do work >
< calculate write addresses >
< store to global memory from shared memory >

My question really pertains to whether I should promote as much work as possible before the __syncthreads to overlap any long load latencies. For example, if calculating the write addresses does not depend on the shared memory reads, then I should be able to move that step above the __syncthreads call. If a thread is allowed to continue to do work, then this strategy makes sense, meaning warps in the thread can still make progress, but if the thread is stalled anyway until the loads complete, it seems like getting the __syncthreads out of the way is better.

I guess the question boils down to whether its better to place the __syncthreads closer to a write of shared memory or closer to a read. My hunch is that you want to delay it as long as possible, but I suppose it depends on the implementation.


I’m fairly new to CUDA myself, but my understanding is that when it comes to synchronizing threads for accessing shared memory, procrastination is a virtue. After all, unless you have to read from shared memory, there is no need to worry about the state of the shared memory.

See section in the Programming guide v1.1:

"There is full sequential consistency of shared variables withinrelaxed ordering across threads. Only after the execution of

__syncthreads()(Section 4.4.2) are the writes from other threads guaranteed to be visible. Unless thevariable is declared as volatile, the compiler is free to optimize the reads and writes to shared memory as long as the previous statement is met."

Procrastinating on synching threads allows the compiler to optimize reads and writes to shared memory up to that point.