How to keep L1 and L2 cache consistent

bit_mapper · October 26, 2011, 10:07pm

I read a statement that since CUDA does not support global read-after-write operations between multiprocessors, the values stored in the shared L2 and individual L1 caches do not need to be kept consistent. I’m not quite clear what it means? Can anyone help me a bit? Why CUDA doesn’t support read-after-write operations? And why L1 and L2 don’t need to be consistent?

Thanks

seibert · October 27, 2011, 10:50am

I would replace “does not support” with “doesn’t protect you from the race conditions associated with” global read-after-write hazards. Regardless of whether you have coherent caches or not, reading a memory location that another multiprocessor is modifying can easily create race conditions without some kind of locking mechanism. Atomic operations have their own built-in locking, and bypass the L1 cache to avoid the consistency problem. Other locking schemes in CUDA are possible, but discouraged since they can easily have a pretty high performance cost when you scale them to thousands of threads. Given that, I assume NVIDIA decided to skip the extra logic required to keep all the caches in sync, which has its own overhead.

If you are doing your own locking with something like atomicCAS() (which bypasses the L1), then you need to make sure your global memory changes are visible to other multiprocessors before you release the lock. This is one of the few cases, I think, where the memory fence functions are actually useful. Between the write to global memory and the lock release, you need to call __threadfence(). Although it is not described in terms of the cache behavior in the Programming Guide (section B.5), in order to do what the documentation says, __threadfence() must force a cache flush from the L1 to the L2 level.

Topic		Replies	Views
Read-After-Write for a single cuda thread ? (and vice versa) Potential race conditions/issues for gl CUDA Programming and Performance	1	15064	June 21, 2011
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23722	March 21, 2011
Memory programming model of Fermi CUDA Programming and Performance	12	5735	March 22, 2010
Any locking mechanism? CUDA Programming and Performance	9	3839	July 25, 2007
Removing RAW race in global memory using __threadfence() CUDA Programming and Performance	6	1111	July 26, 2013
Fermi L1 Cache coherent? CUDA Programming and Performance	5	15008	May 20, 2010
Dare I use L1 in this way? CUDA Programming and Performance	2	360	October 14, 2023
global memory read after write CUDA Programming and Performance	4	3361	March 25, 2009
Race condition, CUDA Programming and Performance	0	552	April 14, 2014
CUDA Memory Consistency CUDA Programming and Performance	23	55932	March 8, 2007

How to keep L1 and L2 cache consistent

Related topics