What cause dispatch stall? How to avoid it?

SparkHu · November 7, 2022, 8:45am

I wrote a kernel:

template<uint32_t ILP>
__global__
void kernel_IADD3(uint32_t *data) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    uint32_t a[ILP] = {}, b[ILP] = {}, c[ILP], d[ILP] = {}, e[ILP] = {}, f[ILP];
    for (int i = 0; i < ILP; i++) {
        c[i] = data[tid + i];
        f[i] = data[tid + i];
    }
    for (int i = 0; i < 200000; i++) {
        for (int j = 0; j < ILP; j++) {
            a[j] = a[j] * b[(j+1)%ILP] + c[(j+2)%ILP];
            d[j] = d[j] + e[(j+1)%ILP] + f[(j+2)%ILP];
        }
        for (int j = 0; j < ILP; j++) {
            b[j] = b[j] * c[(j+1)%ILP] + a[(j+2)%ILP];
            e[j] = e[j] + f[(j+1)%ILP] + d[(j+2)%ILP];
        }
        for (int j = 0; j < ILP; j++) {
            c[j] = c[j] * a[(j+1)%ILP] + b[(j+2)%ILP];
            f[j] = f[j] + d[(j+1)%ILP] + e[(j+2)%ILP];
        }
    }
    data[tid] = a[0] + f[0];
}

launched it on 3080 with ILP = 2 and made sure only one warp scheduled in smsp(4 warps per sm in total).
The performance was not as expect. I checked the SASS code in nsight compute. The main part of the code above was compiled into this form:

      IADD3 R13, R11, R14, R13    // 
      IMAD R9, R9, R4, R7 
      IADD3 R12, R10, R15, R12 
      IMAD R4, R4, R7, R6 
      IADD3 R15, R13, R10, R15 
      IMAD R5, R5, R16, R9 
      IADD3 R14, R12, R11, R14 
      IMAD R16, R16, R9, R4 
      IADD3 R11, R15, R12, R11 
      IMAD R7, R7, R6, R5 
      IADD3 R10, R14, R13, R10 
      IMAD R6, R6, R5, R16

Each IMAD instruction is interleaved with an IADD3 instruction. Each instruction(no matter IMAD or IADD3) is dependent on the 4th instructions before. As both IMAD and IADD3’s dependent issue latency is 4 cycles, I thought utilization of issue slot should be 100%, fma should be 50%, ALU should be 100%. However, the real value is 64.76% for issue slot, 65.09% for ALU, 31.88% for FMA. I found in the source page that nearly half of warp stall sampling(all cycles) of almost all IADD3(except several instructions, perhaps 1 or 2) instructions is dispatch stall, the other half is selected. 100% of warp stall sampling(all cycles) of almost all IMAD(except several instructions, perhaps 1 or 2) instructions is selected.
So there is one cycle before each IADD3 instruction launched. Warp in that cycle is in dispatch stall state.
I didn’t find any detail about what cause dispatch stall in this case.
What causes it? How to avoid it(In my experiment, only one warp is allowed to run in each smsp)?

Attention:
It seems not all IADD3 instructions are delayed by one cycle to be issued .

njuffa · November 7, 2022, 8:56am

I am not clear on how many dispatch stalls you are seeing, but took note of the following in the Nsight documentation:

Dispatch Stall: A pipeline interlock prevented instruction dispatch for a selected warp.

If dispatch stalls are higher than 5%, please file a bug to NVIDIA with reproducible.

SparkHu · November 7, 2022, 9:04am

I find the statement in the Nsight Graphics document. Maybe it’s not suitable to be by applied to CUDA program?

njuffa · November 7, 2022, 9:07am

Don’t know. There is a subforum dedicated to Nsight Compute, where this question probably has a better chance of getting answered:

jmarusarz · November 10, 2022, 10:08pm

Thanks for submitting this question. It is a complex question that we will need some time to look over and figure out the best explanation. I have filed it with our engineering team and I can let you know when we have more information.

SparkHu · November 11, 2022, 2:30am

Thank you very much.
Do you have any official support to solve this kind of question? It’s better for me to contact your engineers directly.

jmarusarz · November 17, 2022, 9:56pm

The official channel for tools support is the forum. We take in the issue and file it with the other bugs and feature requests we get. It’s in the system now, and will be prioritized along with the other engineering work. There isn’t any faster channel via email etc… It all goes through the same system and is handled by the same team of engineers.

SparkHu · December 6, 2022, 7:55am

Any update? : )

jmarusarz · December 8, 2022, 9:48pm

We have this investigation scheduled in our queue along with all the other tools engineering work. It looks like it will likely be scheduled in one of the next couple of sprints based on current capacity. For reference, the internal bug ID Is 3866264.

SparkHu · January 29, 2023, 8:37am

Any update? Or is there any way I can follow the state of the internal bug by myself?
It really help me do further optimization if you could show more details of the hardware(especially ALU and FMA pipeline).

jmarusarz · February 7, 2023, 9:00pm

Thanks for you patience. We’ve done some internal investigation and determined that the dispatch stalls reported for your kernel are caused by exceeding the amount of registers that can be read from and written to the register file. Bandwidth to the register file is limited and the sequence of IMAD/IADD3 instructions can exceed this limit in some cases. Both instructions have a low instruction latency and require 3 input registers and 1 output register. Re-using inputs, reading from constant memory, or mixing in instructions with less register operands and outputs are possible way to avoid dispatch stalls

SparkHu · February 9, 2023, 8:18am

Thanks for your response. It really help.

Topic		Replies	Views
Why there is a dispatch stall when issue almost every IADD3 instruction in my kernel on 3080? CUDA Programming and Performance	11	1039	November 11, 2022
Does the STG.E instruction on Ampere occupy two clock cycles of the FMAHeavy pipeline? CUDA Programming and Performance	11	744	December 10, 2023
Eligible/Stalled warps CUDA Programming and Performance	2	1185	June 8, 2020
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2489	January 7, 2016
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2561	October 5, 2022
Stall reasons summation is not 100% Nsight Compute	7	978	October 12, 2021
Same kernel 3x slower on CUDA than on OpenCL Nsight Compute cuda	7	880	May 5, 2023
How to understand the "hide latency" CUDA Programming and Performance	13	2667	August 8, 2024
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	940	July 17, 2023
How to keep the float pipe busy? CUDA Programming and Performance	7	700	April 23, 2019

What cause dispatch stall? How to avoid it?

Related topics