How to understand the "hide latency"

LATENCY HIDING

  1. Instruction to dependent instruction latency for math operations

A stream of math instructions will have a data dependency based upon the latency of the pipeline. This can vary from 4 to >16 cycles.

EX1. sequence of dependent integer add instructions

# IADD Rdst, Ra, Rb  // Rdst = Ra + Rb
1 IADD R0, R1, R2
2 IADD R4, R0 , R3
3 IADD R8, R0,  R4

If the SM architecture has a 4 cycle dependent latency then the warp will be stalled for 3 cycles waiting for the Rdst from the previous instruction to be available as an operand to the next instruction. In this example we will assume ALU pipe is 16 lanes wide (GV100 - GH100) and the warp scheduler can issue 1 instruction per cycle but only 0.5 instruction/cycle to ALU.

Below are two timing diagrams showing how the scheduler selects an active (and not stalled warp) and issues the instruction.

In example 1 there is only 1 warp per sub-partition. In this case there are not enough warps to hide latency as for every IADD instruction the warp scheduler has no other warp to pick for 3 cycles. This results in an issue active of 25% of the cycles and alu pipe active of 50%.

In example 2 there is 2 warps per sub-partition. Let’s assume both are active and executing the same sequence of 3 IADD instructions. The warp scheduler is able to switch between warp 0 and warp 4 whenever the alu pipe is ready for a new instruction. This results in a issue active of 50% and alu pipe active of 100%. If more warps were allocated to the SM sub-partition and the warps were eligible (not stalled) and the instruction type was not for the alu pipe then the warp scheduler could likely use the addition issue cycles.

LEGEND
    S = selected
    N = not selected
    W = wait
    T = math pipe throttle - pipe is issuing another instruction
    1 = pipe is issuing instruction
    0 = pipe is ready to issue

EXAMPLE 1 : 1 Warp per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
                         
warp 0  S W W W S W W W S W W W
issue   1 0 0 0 1 0 0 0 1 0 0 0
alu     1 1 0 0 1 1 0 0 1 1 0 0     (1 issue active, 0 issue ready)

alu pipe active     50%
issue active        25%

EXAMPLE 2 : 2 Warps per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
warp 0  S W W W S W W W S W W W
warp 4  N T S W W W S W W W W S W W W
issue   1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
alu     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

alu pipe active     100%
issue active        50%
  1. Instruction to dependent instruction latency for memory operation

Memory dependencies are like the IADD example above but the latency is (a) variable, and (b) significantly longer making the need for more warps to hide the latency.

  • <100 cycles for L1 hit
  • 200 cycles for L2 hit
  • 400-800 cycles for L2 miss to device memory
  • greater than 800 cycles for L2 miss to system memory

REPLY TO QUESTIONS

first,
So, I think there is only one block on each SM if GEMM implement as your suggestion. other block should wait until the previous block finishing, if they want to be issued, is it right?

More than 1 thread block can be resident on a SM. In example 1 you would need at least 2 warps per SM sub-partition (8 per SM) to saturate ALU pipe. In your example each thread block requires 32 KiB of shared memory on an SM with 48 KiB shared memory resulting in a maximum occupancy of 1 thread block per SM. In this case you would want to increase your thread block size to 256/512/1024 threads to have sufficient warps to hide latency.

second,
one block can be executed by two or more SM? I’m not sure.

A thread block is launched by the compute work distributor (CWD) to one SM. The thread block will remain resident on the SM for the life of all threads in the thread block. A thread block cannot span multiple SMs.

third,
“hide latency” is warp-level or block-level?
I think the “hide latency” mechanism won’t work if it is block-level under this situation: just 64 cuda_cores per SM, and the blocksize is >= 128 threads.
So, warp-level or block-level?

Thread blocks are rasterized into warps of threads. Warps are launched on SM sub-partitions and remain resident on the SM sub-partition for the life of the warp.

The warp scheduler doesn’t really understand thread blocks. The warp scheduler schedules warps. The priority for selection of the next warp when there are multiple eligible (not stalled) warps could consider the thread block.

The warp schedulers ability to every cycle switch between warps with 0 cost is how NVIDIA SMs hide latency.

If you only launch 1 thread block (2024/08/08 of 1 warp) per SM (for example, require more than 50% of the shared memory) then only 1 of the 4 SM sub-partitions will be active in each SM.

just 64 cuda_cores per SM

CUDA Core means “FP32 execution unit”. I believe the term “core” is confusing you. This is not equivalent to a CPU core. A CPU core backend contains a collection of execution units such as ALU, FP32, Branch, SIMD, Load/Store Unit, … (google “Intel Arch Pipeline Diagram”). The CUDA Core is equivalent to a FP32 unit or SIMD unit with N lanes.

The 64 FP32 execution units are actually divided equally among the 4 SM sub-partitions. In Example 1 & 2 I the warp scheduler “select” an eligible warp and issue the instruction the the ALU pipe. From the warp scheduler this takes 1 cycle. On the next cycle the warp scheduler can select another eligible warp. The ALU pipe is only 16 lanes wide so a warp takes 2 cycles to issue so the ALU pipe is not available for one more cycle. The same is true for NVIDIA GPUs with 64 CUDA Cores. NVIDIA GA10x, AD10x, and GH100 GPUs have two separate 16 lane width FP32 units (fmalite and fmaheavy) so the warp scheduler can issue FP32 instructions every cycle.

1 Like