In Fermi, it seems L1/shared memory are associated together and flexibly controllable.i.e. given 64K size, programmer can specify either 16K as shared memory with the rest 48K as L1, or choose 48K as shared memory with the rest 16K as L1.
My question is:
What the difference between L1 and Shared memory? Sounds like we can manage shared memory but not the L1 cache, but it makes no difference if we just use the whole 64K as cache rather than shared memory. Is it true?
Can we just saturate the whole 64K for L1 or just for shared memory?
The main difference between shared memory and the L1 is that the contents of shared memory are managed by your code explicitly, whereas the L1 cache is automatically managed. Shared memory is also a better way to exchange data between threads in a block with predictable timing. My rule of thumb is: unpredictable reads and writes => prefer L1.
There is no setting to configure all 64 kB for shared memory or L1 cache.
So you mean if the reads and writes can be specified regularly, we can use shared memory. If not, we can use L1. But actually there is no performance difference between the two cases because they are almost identical. Is it correct? especially when the kernel is doing random access and writes.
Is “16K L1+48K shared” and “48K L1+ 16K shared” the only two ways to divide this on-chip memory chunk? no other configuration allowed?
There might be some additional latency when using L1 due to the need to translate the global address you are accessing to a cache location, however I haven’t run any microbenchmarks to check that.
As to why the 64 kB can’t be assigned to all shared memory or all L1, I’m also not sure. I was genuinely surprised that NVIDIA gave us the option to configure the L1/shared memory ratio at all, given that every feature adds complexity and engineering time.