Memory Consistency and __syncthreads()


I have a couple of questions regarding global memory consistency in CUDA. I looked through the existing thread on memory consistency but I thought it might be a good idea to just ask these questions here again.

  1. Is it a guaranteed that all writes to global memory by threads in a thread block prior to a __syncthreads() are visible to all threads in the thread block after the __syncthreads() has executed?

  2. The CUDA 2.0 manual only talks about the visibility of shared memory writes across a thread block in the context of __syncthreads(), while the 3.1 manual also talks about writes to global memory. Was this added to the semantics of __syncthreads in later versions of CUDA?

  3. Is there a single document that explains the consistency aspects of the CUDA memory model in detail? Such a document would be really helpful!

I would appreciate any help on these.

Thanks in advance!

Yes, from B.6 of programming Guide 4.0

I’m not sure if there is a good single source for this, although you should take a look at Appendix B.5 in the CUDA C Programming Guide (4.0) that describes the memory fence functions. Those functions add to the default behavior, so perhaps you’ll get a better sense of the consistency issues.