How to understand the "hide latency"

“hide latency” is an important concept in SIMT of cuda.
but, I don’t know what “latency” are worth being hidden by cuda’s scheduler?
for example1, a thread_block will stall if cache-miss happen during the thread_block accessing global memory, and then, other thread_block will be scheduled because the previous thread_block is stalling.
I think the old thread_block encounter with cache-miss, then, the probability of cache-miss will be more bigger for the new thread_block.
for example2, a thread_block are stalling, and then, cuda scheduler will select other thread_block to running. I think at least the instruction pipeline of the previous block needs to be drained before the new block can be scheduled. so, I think the cost is very expensive if instruction pipeline needs to be filled and drained. I think this filling and draining cannot be balanced by the profit of SIMT.

So, I’m very confused:

  1. what “latency” can be hidden ?
  2. how can I figure out them in cuda profiler(such as NCU) ?
  3. how cuda to reduce the cost of block-switch ?

This topic is covered in some detail in unit 3 of this online training series.

Your conjecture about instruction pipeline is not reflected in the actual behavior. The “instruction pipeline” in a GPU is not contained within a single functional unit. Rather, different instructions issued to different functional units, each of which is pipelined. None of these pipelines need to be “drained” on a context switch (changing execution from one warp to another). There may be some secondary concerns around the effect on the instruction cache; I wouldn’t be able to comment on that. However, I’m quite certain that if a secondary concern like that does exist, it in no way outweighs the benefits of latency hiding.

GPUa implement zero-cost switching between threads. If a thread stalls, a different thread takes its place. Load latency and instruction latency are the most common latencies to be hidden, typically in that order.

This immediately tells us the most important mechanism to hide latency: run lots of threads so there is always another thread ready to run. That is the programmer’s responsibility. The compiler has a supporting role by scheduling instructions such that instructions with (potentially) long latency are scheduled early. In real-life code, this is most easily observed with load instructions in a piece of straight-line code. The compiler will tend to schedule many of the loads early, subject to trade-offs with register pressure.

hello njuffa, very thanks for your clear description.
as you said:

the most important mechanism to hide latency: run lots of threads so there is always another thread ready to run。

I have a inverse example: we implement GEMM by block-level, sometimes, this thread number of block is 128/256,
let’s assume: our GPU don’t have TensorCore, just cuda_core, and 128core/SM,48K share_mem/SM and 32K shared_mem is allocated in each block.
my question:

  1. 128 or 256 is big enough?
  2. the thread_block on each SM cannot be switched if it is working, because each SM can only support 1 block(32K shared_mem is allocated). is it right? how cuda_scheduler to take its worth, how it can hide various of latency?

You would want to use the CUDA profiler to help you guide your development process. Note that GPUs have a hierarchy of schedulers, and the details of their operation can change between GPU architectures. So you might need to profile on multiple architectures depending on how broad a spectrum of GPU architectures you need to cover. BTW, do you really need to build your own GEMM? To first order, NVIDIA has done the work for you by supplying CUBLAS.

That said, some useful heuristics can be formulated. I usually recommend 128 threads per thread block as a starting point, because finer block granularity increases the chances of maximizing total active thread count per SM. Other concerns may cause one to deviate from that towards 64-thread blocks on rare occasions and more commonly towards block with between 128 and 256 threads (e.g. 256 threads lends itself to a 16x16 tile). Generally speaking it is sub-optimal from a latency-hiding perspective to run with just one thread block on each SM, the goal should be to run two ore more thread blocks per SM. Ideally the total number of threads in a grid should exceed the number of threads able to run concurrently across all SMs. On older GPU architectures “over-subscription” of up to 20x was often optimal, my understanding is that this number is reduced on more recent architectures.

Hi,njuffa:
I don’t prepare to implement GEMM by myself, I just want to distinguish my problem through the GEMM example.
ok, let me follow your description:

first,

I usually recommend 128 threads per thread block as a starting point

each SM own 128 or less cuda_core in recent GPU, such as ampere, just 64 cuda_core per SM in A100
So, I think there is only one block on each SM if GEMM implement as your suggestion. other block should wait until the previous block finishing, if they want to be issued, is it right?

second,
one block can be executed by two or more SM? I’m not sure.

third,
“hide latency” is warp-level or block-level?
I think the “hide latency” mechanism won’t work if it is block-level under this situation: just 64 cuda_cores per SM, and the blocksize is >= 128 threads.
So, warp-level or block-level?

LATENCY HIDING

  1. Instruction to dependent instruction latency for math operations

A stream of math instructions will have a data dependency based upon the latency of the pipeline. This can vary from 4 to >16 cycles.

EX1. sequence of dependent integer add instructions

# IADD Rdst, Ra, Rb  // Rdst = Ra + Rb
1 IADD R0, R1, R2
2 IADD R4, R0 , R3
3 IADD R8, R0,  R4

If the SM architecture has a 4 cycle dependent latency then the warp will be stalled for 3 cycles waiting for the Rdst from the previous instruction to be available as an operand to the next instruction. In this example we will assume ALU pipe is 16 lanes wide (GV100 - GH100) and the warp scheduler can issue 1 instruction per cycle but only 0.5 instruction/cycle to ALU.

Below are two timing diagrams showing how the scheduler selects an active (and not stalled warp) and issues the instruction.

In example 1 there is only 1 warp per sub-partition. In this case there are not enough warps to hide latency as for every IADD instruction the warp scheduler has no other warp to pick for 3 cycles. This results in an issue active of 25% of the cycles and alu pipe active of 50%.

In example 2 there is 2 warps per sub-partition. Let’s assume both are active and executing the same sequence of 3 IADD instructions. The warp scheduler is able to switch between warp 0 and warp 4 whenever the alu pipe is ready for a new instruction. This results in a issue active of 50% and alu pipe active of 100%. If more warps were allocated to the SM sub-partition and the warps were eligible (not stalled) and the instruction type was not for the alu pipe then the warp scheduler could likely use the addition issue cycles.

LEGEND
    S = selected
    N = not selected
    W = wait
    T = math pipe throttle - pipe is issuing another instruction
    1 = pipe is issuing instruction
    0 = pipe is ready to issue

EXAMPLE 1 : 1 Warp per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
                         
warp 0  S W W W S W W W S W W W
issue   1 0 0 0 1 0 0 0 1 0 0 0
alu     1 1 0 0 1 1 0 0 1 1 0 0     (1 issue active, 0 issue ready)

alu pipe active     50%
issue active        25%

EXAMPLE 2 : 2 Warps per SM Sub-partition

        cycles              1 1 1 1 1
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
warp 0  S W W W S W W W S W W W
warp 4  N T S W W W S W W W W S W W W
issue   1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
alu     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

alu pipe active     100%
issue active        50%
  1. Instruction to dependent instruction latency for memory operation

Memory dependencies are like the IADD example above but the latency is (a) variable, and (b) significantly longer making the need for more warps to hide the latency.

  • <100 cycles for L1 hit
  • 200 cycles for L2 hit
  • 400-800 cycles for L2 miss to device memory
  • greater than 800 cycles for L2 miss to system memory

REPLY TO QUESTIONS

first,
So, I think there is only one block on each SM if GEMM implement as your suggestion. other block should wait until the previous block finishing, if they want to be issued, is it right?

More than 1 thread block can be resident on a SM. In example 1 you would need at least 2 warps per SM sub-partition (8 per SM) to saturate ALU pipe. In your example each thread block requires 32 KiB of shared memory on an SM with 48 KiB shared memory resulting in a maximum occupancy of 1 thread block per SM. In this case you would want to increase your thread block size to 256/512/1024 threads to have sufficient warps to hide latency.

second,
one block can be executed by two or more SM? I’m not sure.

A thread block is launched by the compute work distributor (CWD) to one SM. The thread block will remain resident on the SM for the life of all threads in the thread block. A thread block cannot span multiple SMs.

third,
“hide latency” is warp-level or block-level?
I think the “hide latency” mechanism won’t work if it is block-level under this situation: just 64 cuda_cores per SM, and the blocksize is >= 128 threads.
So, warp-level or block-level?

Thread blocks are rasterized into warps of threads. Warps are launched on SM sub-partitions and remain resident on the SM sub-partition for the life of the warp.

The warp scheduler doesn’t really understand thread blocks. The warp scheduler schedules warps. The priority for selection of the next warp when there are multiple eligible (not stalled) warps could consider the thread block.

The warp schedulers ability to every cycle switch between warps with 0 cost is how NVIDIA SMs hide latency.

If you only launch 1 thread block per SM (for example, require more than 50% of the shared memory) then only 1 of the 4 SM sub-partitions will be active.

just 64 cuda_cores per SM

CUDA Core means “FP32 execution unit”. I believe the term “core” is confusing you. This is not equivalent to a CPU core. A CPU core backend contains a collection of execution units such as ALU, FP32, Branch, SIMD, Load/Store Unit, … (google “Intel Arch Pipeline Diagram”). The CUDA Core is equivalent to a FP32 unit or SIMD unit with N lanes.

The 64 FP32 execution units are actually divided equally among the 4 SM sub-partitions. In Example 1 & 2 I the warp scheduler “select” an eligible warp and issue the instruction the the ALU pipe. From the warp scheduler this takes 1 cycle. On the next cycle the warp scheduler can select another eligible warp. The ALU pipe is only 16 lanes wide so a warp takes 2 cycles to issue so the ALU pipe is not available for one more cycle. The same is true for NVIDIA GPUs with 64 CUDA Cores. NVIDIA GA10x, AD10x, and GH100 GPUs have two separate 16 lane width FP32 units (fmalite and fmaheavy) so the warp scheduler can issue FP32 instructions every cycle.

Greg, so kindly description.
thank you

Greg:
I post a wrong reply, I found I’m wrong,
I’ve deleted it, sorry,