Fermi L1 Cache coherent?

Hello everybody,
I have a code where a lot of threads access the same location in global memory. On Fermi cards I would profit from the L1 cache, if I don’t run into false sharing because of cache coherence. Does anyone know how the cache is implemented on GPUs in detail?

I believe the L1 cache system is not coherent across different SMs. Only the L2 cache is coherent across the entire chip. There may be an easy way to explicitly force a specific read or write to bypass L1, but I do not know how. However, I do know that it is possible to disable L1 cache completely at compile time. Depending on your program, you may suffer a performance hit, but you’ll have a fully coherent cache if you choose to go that route.

@ ColinS: I would be interested to know how you disable the L1 cache. If it’s in the manual I apologize - I haven’t taken delivery of my Fermi card yet, so haven’t read the manual.



Okay so each SM has its own L1 Cache that is not coherent across different SM, but what happens if all threads within one SM access the same location in global memory? On CPUs I would get false sharing and that would cripple performance, so what happens on Fermi GPUs?

With the exception of the various hacks that add new (and unsupported) concurrency primitives, if you have different threads reading and writing to the same global memory location in the same kernel, then you will have undefined results in general anyway. Atomic operations on global memory have to bypass the L1 cache to have correct operation, so those should be fine.

(Edit: Of course, the case where many threads read, and only read, the same global memory locations is fine, regardless of the L1 cache coherence issue.)

This is another case where CUDA discouraging the use of complex synchronization allows them to make the hardware much simpler. No need for L1 cache snooping.

Ok thanks that helped.

The threads only read from the same position in global memory, so I should be fine and the L1 cache should help a lot.