Cache line size of L1 and L2

bit_mapper · November 13, 2011, 9:30pm

I read a sentence from programming guide regarding cache line size and feature, but still confused about this statement below:

Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are
serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Does it mean L1 cache line size is 128-byte, while L2 cache line size is only 32-byte? What does over-fetch mean?

seibert · November 13, 2011, 10:51pm

Yes, I believe so.

That refers to fetching more data than required when non-contiguous memory locations are read by a warp. If a warp is doing a scattered read of floats, with memory addresses being accessed that are far from each other, the memory controller will need to read more data than required, because each float will bring in an entire cache line. In the case of reads going through the L1, over-fetch results in 32x more data being read than required, and for reads going only through the L2, that number goes down to 8x.

bit_mapper · November 14, 2011, 2:01am

I see! Thank you seibert. But I have two more confusion.

Since L2 is shared my all the multiprocessors, say N, L2 will be frequently updated by the N Multiprocessors all the time, in unit of 32-byte. Is this correct? I’m wondering how many accesses to L2 can be serviced at the same time?

To be more specific, if the kernel of all threads is to random access the global memory in the scattered way rather than the coalesced way, for any warp, there gonna be 32 memory transactions rather than 1. My concern is whether all of these 32 memory transactions gonna happen sequentially? or partly parallel? And if random access, L1 and especially L2 seem have no effect for efficiency.

It is said in the programming guide that access global memory takes 400-500 cycles, but it doesn’t mention the breakdown of that cycles in terms of L1 cycles and L2 cycles. I have interest to know how L1 cycles compared to L2 cycles, although they have difference cache line size.

seibert · November 14, 2011, 3:39pm

I haven’t seen any benchmarks of L2 transaction throughput. A small benchmark might be required here.

That time is almost certainly dominated by the memory controller (and memory) latency, and not the cache latency. Again, I haven’t seen any benchmarks of this.

Topic		Replies	Views
Memory Transaction Width and L2 Cache Fill - Compute Capability width 2.x and 3.0 CUDA Programming and Performance	3	1339	June 28, 2012
L1-L2-Global how to clearly describe their interaction for a given kernel CUDA Programming and Performance	3	2065	April 15, 2012
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	1109	April 18, 2024
Cache L1 and L2 Architecture Kepler CUDA Programming and Performance	2	3181	December 30, 2019
Cache access characteristics CUDA Programming and Performance	0	586	February 17, 2011
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	105	January 8, 2025
Issues about L1 cache CUDA Programming and Performance	10	63	February 26, 2025
variable cache line width ? CUDA Programming and Performance	4	2019	January 13, 2015
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1191	April 7, 2024
Memory transaction size CUDA Programming and Performance	1	1731	February 12, 2017

Cache line size of L1 and L2

Related topics