How to understand the "hide latency"

Greg · July 6, 2023, 6:55pm

LATENCY HIDING

Instruction to dependent instruction latency for math operations

A stream of math instructions will have a data dependency based upon the latency of the pipeline. This can vary from 4 to >16 cycles.

EX1. sequence of dependent integer add instructions

# IADD Rdst, Ra, Rb  // Rdst = Ra + Rb
1 IADD R0, R1, R2
2 IADD R4, R0 , R3
3 IADD R8, R0,  R4

If the SM architecture has a 4 cycle dependent latency then the warp will be stalled for 3 cycles waiting for the Rdst from the previous instruction to be available as an operand to the next instruction. In this example we will assume ALU pipe is 16 lanes wide (GV100 - GH100) and the warp scheduler can issue 1 instruction per cycle but only 0.5 instruction/cycle to ALU.

Below are two timing diagrams showing how the scheduler selects an active (and not stalled warp) and issues the instruction.

In example 1 there is only 1 warp per sub-partition. In this case there are not enough warps to hide latency as for every IADD instruction the warp scheduler has no other warp to pick for 3 cycles. This results in an issue active of 25% of the cycles and alu pipe active of 50%.

In example 2 there is 2 warps per sub-partition. Let’s assume both are active and executing the same sequence of 3 IADD instructions. The warp scheduler is able to switch between warp 0 and warp 4 whenever the alu pipe is ready for a new instruction. This results in a issue active of 50% and alu pipe active of 100%. If more warps were allocated to the SM sub-partition and the warps were eligible (not stalled) and the instruction type was not for the alu pipe then the warp scheduler could likely use the addition issue cycles.

LEGEND
    S = selected
    N = not selected
    W = wait
    T = math pipe throttle - pipe is issuing another instruction
    1 = pipe is issuing instruction
    0 = pipe is ready to issue

EXAMPLE 1 : 1 Warp per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
                         
warp 0  S W W W S W W W S W W W
issue   1 0 0 0 1 0 0 0 1 0 0 0
alu     1 1 0 0 1 1 0 0 1 1 0 0     (1 issue active, 0 issue ready)

alu pipe active     50%
issue active        25%

EXAMPLE 2 : 2 Warps per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
warp 0  S W W W S W W W S W W W
warp 4  N T S W W W S W W W W S W W W
issue   1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
alu     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

alu pipe active     100%
issue active        50%

Instruction to dependent instruction latency for memory operation

Memory dependencies are like the IADD example above but the latency is (a) variable, and (b) significantly longer making the need for more warps to hide the latency.

<100 cycles for L1 hit
200 cycles for L2 hit
400-800 cycles for L2 miss to device memory
greater than 800 cycles for L2 miss to system memory

REPLY TO QUESTIONS

first,
So, I think there is only one block on each SM if GEMM implement as your suggestion. other block should wait until the previous block finishing, if they want to be issued, is it right?

More than 1 thread block can be resident on a SM. In example 1 you would need at least 2 warps per SM sub-partition (8 per SM) to saturate ALU pipe. In your example each thread block requires 32 KiB of shared memory on an SM with 48 KiB shared memory resulting in a maximum occupancy of 1 thread block per SM. In this case you would want to increase your thread block size to 256/512/1024 threads to have sufficient warps to hide latency.

second,
one block can be executed by two or more SM? I’m not sure.

A thread block is launched by the compute work distributor (CWD) to one SM. The thread block will remain resident on the SM for the life of all threads in the thread block. A thread block cannot span multiple SMs.

third,
“hide latency” is warp-level or block-level？
I think the “hide latency” mechanism won’t work if it is block-level under this situation: just 64 cuda_cores per SM, and the blocksize is >= 128 threads.
So, warp-level or block-level?

Thread blocks are rasterized into warps of threads. Warps are launched on SM sub-partitions and remain resident on the SM sub-partition for the life of the warp.

The warp scheduler doesn’t really understand thread blocks. The warp scheduler schedules warps. The priority for selection of the next warp when there are multiple eligible (not stalled) warps could consider the thread block.

The warp schedulers ability to every cycle switch between warps with 0 cost is how NVIDIA SMs hide latency.

If you only launch 1 thread block (2024/08/08 of 1 warp) per SM (for example, require more than 50% of the shared memory) then only 1 of the 4 SM sub-partitions will be active in each SM.

just 64 cuda_cores per SM

CUDA Core means “FP32 execution unit”. I believe the term “core” is confusing you. This is not equivalent to a CPU core. A CPU core backend contains a collection of execution units such as ALU, FP32, Branch, SIMD, Load/Store Unit, … (google “Intel Arch Pipeline Diagram”). The CUDA Core is equivalent to a FP32 unit or SIMD unit with N lanes.

The 64 FP32 execution units are actually divided equally among the 4 SM sub-partitions. In Example 1 & 2 I the warp scheduler “select” an eligible warp and issue the instruction the the ALU pipe. From the warp scheduler this takes 1 cycle. On the next cycle the warp scheduler can select another eligible warp. The ALU pipe is only 16 lanes wide so a warp takes 2 cycles to issue so the ALU pipe is not available for one more cycle. The same is true for NVIDIA GPUs with 64 CUDA Cores. NVIDIA GA10x, AD10x, and GH100 GPUs have two separate 16 lane width FP32 units (fmalite and fmaheavy) so the warp scheduler can issue FP32 instructions every cycle.

Topic		Replies	Views
Warp Size Question CUDA Programming and Performance	21	14263	June 18, 2010
A question about warps and threadblock CUDA Programming and Performance	5	938	April 14, 2013
Waiting for global memory access. CUDA Programming and Performance	32	56630	January 31, 2008
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8676	March 28, 2008
How many parallel threads? CUDA Programming and Performance	19	10438	October 1, 2021
Basic question about hiding latency CUDA Programming and Performance	6	2202	July 9, 2014
Can threads in a warp from different blocks? CUDA Programming and Performance	17	12031	March 26, 2010
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14300	November 18, 2008
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	20123	July 5, 2011
Scheduling block execution Do multiprocessors block each other? CUDA Programming and Performance	45	23133	June 7, 2010

How to understand the "hide latency"

Related topics