I have a couple of questions regarding global memory consistency in CUDA. I looked through the existing thread on memory consistency but I thought it might be a good idea to just ask these questions here again.
Is it a guaranteed that all writes to global memory by threads in a thread block prior to a __syncthreads() are visible to all threads in the thread block after the __syncthreads() has executed?
The CUDA 2.0 manual only talks about the visibility of shared memory writes across a thread block in the context of __syncthreads(), while the 3.1 manual also talks about writes to global memory. Was this added to the semantics of __syncthreads in later versions of CUDA?
Is there a single document that explains the consistency aspects of the CUDA memory model in detail? Such a document would be really helpful!
I’m not sure if there is a good single source for this, although you should take a look at Appendix B.5 in the CUDA C Programming Guide (4.0) that describes the memory fence functions. Those functions add to the default behavior, so perhaps you’ll get a better sense of the consistency issues.