Continuing the discussion from Understanding L1/TEX Cache Sectors/Req:
I learned from Greg here that
The L1 tag stage can resolve 4 sets x 4 sectors per cycle.
My understanding is that:
- When a warp issues a load instruction that reads memory locations that are all within 1 cache line (e.g., read size = 4 bytes, stride=4 bytes), the L1 tag stage can happily resolve in one cycle.
- When a warp issues a load instruction that reads memory locations that are all within 4 cache lines (e.g., read size = 4 bytes, stride=16 bytes), the L1 tag stage can still resolve in one cycle.
However, if the warp access 16 different cache lines, e.g.:
- thread 0 access cache line 0, sector 0
- thread 1 access cache line 1, sector 1
- …
- thread 15 access cache line 15, sector 15
- thread 16 - 31 are predicated off
In general, my understanding is that these 16 cache lines would at most require 4 cycles to resolve (each resolves 4 cache lines). But I wonder what happens if the 16 cache lines coincidentally belong to 4 cache sets as Nvidia GPUs are using 4-way associative set for L1 cache? Would it be possible for these 16 cache line reads fit into the one cycle as they are still within the 4 cache set x 4 sector limit?
I know in practice the probability is negligible, but I am asking this to make sure my understanding of the “4 cache set x 4 sector” limit is accurate. Thanks!