Based on Magnus Strengert’s discussion in his GTC video on Nsight Compute, I thought it was the case that each global load instruction from a warp generates a single request into the L1/Tex unit, and that the minimum number of sectors/req would be 4, since the Load memory access granularity is 128byte cache line and sectors are 32bytes.
However in Nsight Compute memory workload analysis, Memory Tables, I frequently see this type of info (see attached image)
which shows Sectors/Req = 1.52. Under what conditions/circumstances can the Sectors/Req < 4, and what is the explanation for how/when this could occur?
Thank you!
If all threads are predicated off then sectors/request = 0.
If 1 thread is active and predicated on then sectors/request = 1.
If 32 threads are active and predicated on and each thread accesses a unique 128 byte cache line then the sectors/request = 32.
The L1 and L2 cache line size is 128 bytes compromised of 4 x 32B sectors. This does not have an impact on the sectors/request. The L1 tag stage can resolve 4 sets x 4 sectors per cycle.
Thanks Greg! To check my understanding, let me throw out a few more cases and see if they are correct:
Assume
case 1: 8 threads out of 32 (in warp) are predicated on/active (are these equivalent/identical?)
then
case 1a: all 8 active threads request addresses within a 32byte sector (within a single cache line) = 1 sector/req
case 1b: all 8 active threads request addresses within 4x 32byte sectors (within a single cache line) = 4 sector/req
case 2: 32 threads active, all threads request same address = 1 sector/req?
Additional clarifications:
a. For a given single warp, the value of sectors/req will always be integer between 0 and 32?
b. So, to end up with fractional sectors/req (as seen in NCU), requires averaging over multiple warps?
E.g. only 2 warps, with warp 1 has 1 sec/req, warp 2 has 2 sec/req, then we see 1.5 sec/req in NCU?
c. Per your comment “The L1 tag stage can resolve 4 sets x 4 sectors per cycle” - is this because of the 4 SP / SM, one set per SP? (BTW to be clear… what is a ‘set’ here?)
Yes, you have a correct understanding. There may be edge cases and slight differences per architecture based upon (a) maximum threads supported by LSU per wavefront (tag stage) and (b) vector data types.
A thread is active if the thread is marked active in the active mask. A thread may be marked exited if it executes EXIT (or was not launched) and it may be marked inactive on a divergent branch.
A thread’s predicated on/off status is per instruction and is defined by the instruction predicate guard. Some instructions support a second predicated. Older architecture (e.g. Kepler) support condition codes that can also result in disablement for the instruction.
At least 1 thread has to be active in a warp for an instruction to be scheduled. If all active threads are predicated off then the instruction has no side effect on the register file or memory; however, the instruction may still have to fully go through the pipeline so that the instruction can resolve dependencies in the correct order.
Additional clarifications:
a. For a given single warp, the value of sectors/req will always be integer between 0 and 32?
For LSU instructions the range should be 0-32.
b. So, to end up with fractional sectors/req (as seen in NCU), requires averaging over multiple warps?
Fractional sectors/req require average over multiple warp requests (instructions). The instructions can be issued from the same warp or different warps.
E.g. only 2 warps, with warp 1 has 1 sec/req, warp 2 has 2 sec/req, then we see 1.5 sec/req in NCU?
Correct.
c. Per your comment “The L1 tag stage can resolve 4 sets x 4 sectors per cycle” - is this because of the 4 SP / SM, one set per SP? (BTW to be clear… what is a ‘set’ here?)
No. A cache set refers to a group of cache lines within a cache where a specific memory address can potentially be stored. The ability to resolve 4 sets allows for improved handling of warps with address divergence.