Cache L1 and L2 Architecture Kepler

danielavellar15 · February 5, 2017, 12:41am

I would like to know how the l1 and l2 cache works in kepler architecture, I searched for articles but did not have much information

tdd11235813 · February 6, 2017, 10:04pm

L1 cache is on-chip (on multiprocessor) and L2 cache is off-chip.
L1 cache on Kepler is normally used for local memory only (register spills, local dynamic indexed arrays).
L1 and shared memory use the same memory of 64 Kb and can be configured to 16/32/48 Kb each.
L1 cache serves a (local) memory load request in a 128-byte granularity (cache line is 128-byte).
L2 cache provides 1.5MiB and serves a memory load request in 32-byte granularity, but a load transaction can be conducted by one, two or four 32-byte segments.
Stores are not cached, but also works on 32-byte granularity.
So that’s the reason why you should keep memory accesses coalesced within a warp, otherwise requests become serialized into more than just one transaction.
These caches are good for spatial coherent memory accesses, but not for temporal coherence (due to massive parallelism caches cannot keep the data for “long”).

On some Kepler you can enable L1 for caching global loads (>=K40 ?).

-Xptxas -dlcm=ca

It might help when you reuse 128-byte cachelines a lot, but in this case it is likely to get even better performance when using shared memory controlling the cache coherency by your own.

So, in other words. If a warp requests for a piece of memory, say 32 coalesced floats = 128 byte, it directly looks in the L2 cache, when L1 caching is disabled.
If there was a hit, 128-byte will be served to the warp in one (four-segment) transaction. Otherwise on a L2 miss, the request is serviced by the device memory as a 128-byte transaction. If the requested memory block was not aligned to 128-byte, two transactions will be needed.

The data words requested need to be aligned to at least to 32-byte boundaries (first address of memory is multiple of 32 bytes), otherwise memory requests will be serialized again. For cached loads (i.e. with L1 cache enabled for global loads) 128-byte boundaries of load requests have to be ensured.

In my experience, caches will not be reset between kernel launches as long as there are no other GPU tasks in the background.

I have not checked, but I guess you find some explanations in the Kepler Tuning guide as well.
If you want to have a book about such questions you can have a look on “Professional CUDA C Programming”.
If you want to read about more low-level things, Dissecting GPU Memory Hierarchy through Microbenchmarking (source) might help.
Or try to describe, what specific problem brought you here.

gpu_l2_cache · December 30, 2019, 3:25am

Hi tdd11235813,
You are miss leading the community!

“L1 cache is on-chip (on multiprocessor) and L2 cache is off-chip.”

It is totally wrong! L2 cache is on-chip!! please get a clear definition of on-chip and off-chip!!

and the results in the paper [1509.02308] Dissecting GPU Memory Hierarchy through Microbenchmarking is totally wrong!

Topic		Replies	Views
Cache line size of L1 and L2 CUDA Programming and Performance	3	20776	November 14, 2011
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7157	April 17, 2013
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	465	January 15, 2024
L1-L2-Global how to clearly describe their interaction for a given kernel CUDA Programming and Performance	3	2069	April 15, 2012
Is cache access coalesced? CUDA Programming and Performance	4	2024	September 5, 2016
Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler CUDA Programming and Performance	6	3331	September 19, 2014
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1467	August 28, 2019
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	1174	April 18, 2024
Cache coherence of GPU CUDA Programming and Performance	3	56	April 28, 2025
variable cache line width ? CUDA Programming and Performance	4	2024	January 13, 2015

Cache L1 and L2 Architecture Kepler

Related topics