Fermi L1 Cache coherent?

Radfahrer · May 18, 2010, 12:34pm

Hello everybody,
I have a code where a lot of threads access the same location in global memory. On Fermi cards I would profit from the L1 cache, if I don’t run into false sharing because of cache coherence. Does anyone know how the cache is implemented on GPUs in detail?

ColinS · May 18, 2010, 9:37pm

I believe the L1 cache system is not coherent across different SMs. Only the L2 cache is coherent across the entire chip. There may be an easy way to explicitly force a specific read or write to bypass L1, but I do not know how. However, I do know that it is possible to disable L1 cache completely at compile time. Depending on your program, you may suffer a performance hit, but you’ll have a fully coherent cache if you choose to go that route.

MMB · May 18, 2010, 10:55pm

@ ColinS: I would be interested to know how you disable the L1 cache. If it’s in the manual I apologize - I haven’t taken delivery of my Fermi card yet, so haven’t read the manual.

Thanks

MMB

Radfahrer · May 19, 2010, 11:24am

Okay so each SM has its own L1 Cache that is not coherent across different SM, but what happens if all threads within one SM access the same location in global memory? On CPUs I would get false sharing and that would cripple performance, so what happens on Fermi GPUs?

seibert · May 19, 2010, 3:11pm

With the exception of the various hacks that add new (and unsupported) concurrency primitives, if you have different threads reading and writing to the same global memory location in the same kernel, then you will have undefined results in general anyway. Atomic operations on global memory have to bypass the L1 cache to have correct operation, so those should be fine.

(Edit: Of course, the case where many threads read, and only read, the same global memory locations is fine, regardless of the L1 cache coherence issue.)

This is another case where CUDA discouraging the use of complex synchronization allows them to make the hardware much simpler. No need for L1 cache snooping.

Radfahrer · May 20, 2010, 6:32am

Ok thanks that helped.

The threads only read from the same position in global memory, so I should be fine and the L1 cache should help a lot.

Topic		Replies	Views
Memory programming model of Fermi CUDA Programming and Performance	12	5564	March 22, 2010
disable L1 cache on Fermi GPU running OpenCL CUDA Programming and Performance	9	4118	September 4, 2011
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3043	April 25, 2010
Fermi Cache Architecture Cache, write policy, read policy, architecture CUDA Programming and Performance	6	3221	August 31, 2011
Disabling cache on Fermi architectures Try to disable L1 and L2 CUDA Programming and Performance	11	9261	August 30, 2013
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26180	December 2, 2011
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23544	March 21, 2011
Fermi: Cache configuration default at compile time From shared to L1 CUDA Programming and Performance	4	19525	April 16, 2010
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1165	April 26, 2013
FERMI L1 Information Associativity, Access Pattern CUDA Programming and Performance	3	1339	November 15, 2011

Fermi L1 Cache coherent?

Related topics