Cache L1 and L2 Architecture Kepler

I would like to know how the l1 and l2 cache works in kepler architecture, I searched for articles but did not have much information

L1 cache is on-chip (on multiprocessor) and L2 cache is off-chip.
L1 cache on Kepler is normally used for local memory only (register spills, local dynamic indexed arrays).
L1 and shared memory use the same memory of 64 Kb and can be configured to 16/32/48 Kb each.
L1 cache serves a (local) memory load request in a 128-byte granularity (cache line is 128-byte).
L2 cache provides 1.5MiB and serves a memory load request in 32-byte granularity, but a load transaction can be conducted by one, two or four 32-byte segments.
Stores are not cached, but also works on 32-byte granularity.
So that’s the reason why you should keep memory accesses coalesced within a warp, otherwise requests become serialized into more than just one transaction.
These caches are good for spatial coherent memory accesses, but not for temporal coherence (due to massive parallelism caches cannot keep the data for “long”).

On some Kepler you can enable L1 for caching global loads (>=K40 ?).

-Xptxas -dlcm=ca

It might help when you reuse 128-byte cachelines a lot, but in this case it is likely to get even better performance when using shared memory controlling the cache coherency by your own.

So, in other words. If a warp requests for a piece of memory, say 32 coalesced floats = 128 byte, it directly looks in the L2 cache, when L1 caching is disabled.
If there was a hit, 128-byte will be served to the warp in one (four-segment) transaction. Otherwise on a L2 miss, the request is serviced by the device memory as a 128-byte transaction. If the requested memory block was not aligned to 128-byte, two transactions will be needed.

The data words requested need to be aligned to at least to 32-byte boundaries (first address of memory is multiple of 32 bytes), otherwise memory requests will be serialized again. For cached loads (i.e. with L1 cache enabled for global loads) 128-byte boundaries of load requests have to be ensured.

In my experience, caches will not be reset between kernel launches as long as there are no other GPU tasks in the background.

I have not checked, but I guess you find some explanations in the Kepler Tuning guide as well.
If you want to have a book about such questions you can have a look on “Professional CUDA C Programming”.
If you want to read about more low-level things, Dissecting GPU Memory Hierarchy through Microbenchmarking (source) might help.
Or try to describe, what specific problem brought you here.

Hi tdd11235813,
You are miss leading the community!

“L1 cache is on-chip (on multiprocessor) and L2 cache is off-chip.”

It is totally wrong! L2 cache is on-chip!! please get a clear definition of on-chip and off-chip!!

and the results in the paper is totally wrong!