I read a statement that since CUDA does not support global read-after-write operations between multiprocessors, the values stored in the shared L2 and individual L1 caches do not need to be kept consistent. I’m not quite clear what it means? Can anyone help me a bit? Why CUDA doesn’t support read-after-write operations? And why L1 and L2 don’t need to be consistent?
I would replace “does not support” with “doesn’t protect you from the race conditions associated with” global read-after-write hazards. Regardless of whether you have coherent caches or not, reading a memory location that another multiprocessor is modifying can easily create race conditions without some kind of locking mechanism. Atomic operations have their own built-in locking, and bypass the L1 cache to avoid the consistency problem. Other locking schemes in CUDA are possible, but discouraged since they can easily have a pretty high performance cost when you scale them to thousands of threads. Given that, I assume NVIDIA decided to skip the extra logic required to keep all the caches in sync, which has its own overhead.
If you are doing your own locking with something like atomicCAS() (which bypasses the L1), then you need to make sure your global memory changes are visible to other multiprocessors before you release the lock. This is one of the few cases, I think, where the memory fence functions are actually useful. Between the write to global memory and the lock release, you need to call __threadfence(). Although it is not described in terms of the cache behavior in the Programming Guide (section B.5), in order to do what the documentation says, __threadfence() must force a cache flush from the L1 to the L2 level.