Cache line size of L1 and L2

I read a sentence from programming guide regarding cache line size and feature, but still confused about this statement below:

Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are
serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Does it mean L1 cache line size is 128-byte, while L2 cache line size is only 32-byte? What does over-fetch mean?

Yes, I believe so.

That refers to fetching more data than required when non-contiguous memory locations are read by a warp. If a warp is doing a scattered read of floats, with memory addresses being accessed that are far from each other, the memory controller will need to read more data than required, because each float will bring in an entire cache line. In the case of reads going through the L1, over-fetch results in 32x more data being read than required, and for reads going only through the L2, that number goes down to 8x.

I see! Thank you seibert. But I have two more confusion.

  1. Since L2 is shared my all the multiprocessors, say N, L2 will be frequently updated by the N Multiprocessors all the time, in unit of 32-byte. Is this correct? I’m wondering how many accesses to L2 can be serviced at the same time?

To be more specific, if the kernel of all threads is to random access the global memory in the scattered way rather than the coalesced way, for any warp, there gonna be 32 memory transactions rather than 1. My concern is whether all of these 32 memory transactions gonna happen sequentially? or partly parallel? And if random access, L1 and especially L2 seem have no effect for efficiency.

  1. It is said in the programming guide that access global memory takes 400-500 cycles, but it doesn’t mention the breakdown of that cycles in terms of L1 cycles and L2 cycles. I have interest to know how L1 cycles compared to L2 cycles, although they have difference cache line size.

I haven’t seen any benchmarks of L2 transaction throughput. A small benchmark might be required here.

That time is almost certainly dominated by the memory controller (and memory) latency, and not the cache latency. Again, I haven’t seen any benchmarks of this.