Understanding L1/TEX Cache Sectors/Req

Based on Magnus Strengert’s discussion in his GTC video on Nsight Compute, I thought it was the case that each global load instruction from a warp generates a single request into the L1/Tex unit, and that the minimum number of sectors/req would be 4, since the Load memory access granularity is 128byte cache line and sectors are 32bytes.
However in Nsight Compute memory workload analysis, Memory Tables, I frequently see this type of info (see attached image)


which shows Sectors/Req = 1.52. Under what conditions/circumstances can the Sectors/Req < 4, and what is the explanation for how/when this could occur?
Thank you!

If all threads are predicated off then sectors/request = 0.
If 1 thread is active and predicated on then sectors/request = 1.
If 32 threads are active and predicated on and each thread accesses a unique 128 byte cache line then the sectors/request = 32.

The L1 and L2 cache line size is 128 bytes compromised of 4 x 32B sectors. This does not have an impact on the sectors/request. The L1 tag stage can resolve 4 sets x 4 sectors per cycle.

1 Like

Thanks Greg! To check my understanding, let me throw out a few more cases and see if they are correct:

Assume
case 1: 8 threads out of 32 (in warp) are predicated on/active (are these equivalent/identical?)
then
case 1a: all 8 active threads request addresses within a 32byte sector (within a single cache line) = 1 sector/req
case 1b: all 8 active threads request addresses within 4x 32byte sectors (within a single cache line) = 4 sector/req
case 2: 32 threads active, all threads request same address = 1 sector/req?

Additional clarifications:
a. For a given single warp, the value of sectors/req will always be integer between 0 and 32?
b. So, to end up with fractional sectors/req (as seen in NCU), requires averaging over multiple warps?
E.g. only 2 warps, with warp 1 has 1 sec/req, warp 2 has 2 sec/req, then we see 1.5 sec/req in NCU?
c. Per your comment “The L1 tag stage can resolve 4 sets x 4 sectors per cycle” - is this because of the 4 SP / SM, one set per SP? (BTW to be clear… what is a ‘set’ here?)

Thanks!

Yes, you have a correct understanding. There may be edge cases and slight differences per architecture based upon (a) maximum threads supported by LSU per wavefront (tag stage) and (b) vector data types.

A thread is active if the thread is marked active in the active mask. A thread may be marked exited if it executes EXIT (or was not launched) and it may be marked inactive on a divergent branch.

A thread’s predicated on/off status is per instruction and is defined by the instruction predicate guard. Some instructions support a second predicated. Older architecture (e.g. Kepler) support condition codes that can also result in disablement for the instruction.

At least 1 thread has to be active in a warp for an instruction to be scheduled. If all active threads are predicated off then the instruction has no side effect on the register file or memory; however, the instruction may still have to fully go through the pipeline so that the instruction can resolve dependencies in the correct order.

Additional clarifications:
a. For a given single warp, the value of sectors/req will always be integer between 0 and 32?

For LSU instructions the range should be 0-32.

b. So, to end up with fractional sectors/req (as seen in NCU), requires averaging over multiple warps?

Fractional sectors/req require average over multiple warp requests (instructions). The instructions can be issued from the same warp or different warps.

E.g. only 2 warps, with warp 1 has 1 sec/req, warp 2 has 2 sec/req, then we see 1.5 sec/req in NCU?

Correct.

c. Per your comment “The L1 tag stage can resolve 4 sets x 4 sectors per cycle” - is this because of the 4 SP / SM, one set per SP? (BTW to be clear… what is a ‘set’ here?)

No. A cache set refers to a group of cache lines within a cache where a specific memory address can potentially be stored. The ability to resolve 4 sets allows for improved handling of warps with address divergence.