Question about L1 tag stage resolution per cycle

Continuing the discussion from Understanding L1/TEX Cache Sectors/Req:

I learned from Greg here that

The L1 tag stage can resolve 4 sets x 4 sectors per cycle.

My understanding is that:

  • When a warp issues a load instruction that reads memory locations that are all within 1 cache line (e.g., read size = 4 bytes, stride=4 bytes), the L1 tag stage can happily resolve in one cycle.
  • When a warp issues a load instruction that reads memory locations that are all within 4 cache lines (e.g., read size = 4 bytes, stride=16 bytes), the L1 tag stage can still resolve in one cycle.

However, if the warp access 16 different cache lines, e.g.:

  • thread 0 access cache line 0, sector 0
  • thread 1 access cache line 1, sector 1
  • …
  • thread 15 access cache line 15, sector 15
  • thread 16 - 31 are predicated off

In general, my understanding is that these 16 cache lines would at most require 4 cycles to resolve (each resolves 4 cache lines). But I wonder what happens if the 16 cache lines coincidentally belong to 4 cache sets as Nvidia GPUs are using 4-way associative set for L1 cache? Would it be possible for these 16 cache line reads fit into the one cycle as they are still within the 4 cache set x 4 sector limit?

I know in practice the probability is negligible, but I am asking this to make sure my understanding of the “4 cache set x 4 sector” limit is accurate. Thanks!

No. t-stage can resolve at most 4 cache lines and 4 sectors per cache line. Sector misses will be sent to the miss stage on a line/sector basis. Hits will go to the data stage. d-stage can access 1 cache line at a time so there can be additional serialization at the output of t-stage.