What cause dispatch stall? How to avoid it?

I wrote a kernel:

template<uint32_t ILP>
__global__
void kernel_IADD3(uint32_t *data) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    uint32_t a[ILP] = {}, b[ILP] = {}, c[ILP], d[ILP] = {}, e[ILP] = {}, f[ILP];
    for (int i = 0; i < ILP; i++) {
        c[i] = data[tid + i];
        f[i] = data[tid + i];
    }
    for (int i = 0; i < 200000; i++) {
        for (int j = 0; j < ILP; j++) {
            a[j] = a[j] * b[(j+1)%ILP] + c[(j+2)%ILP];
            d[j] = d[j] + e[(j+1)%ILP] + f[(j+2)%ILP];
        }
        for (int j = 0; j < ILP; j++) {
            b[j] = b[j] * c[(j+1)%ILP] + a[(j+2)%ILP];
            e[j] = e[j] + f[(j+1)%ILP] + d[(j+2)%ILP];
        }
        for (int j = 0; j < ILP; j++) {
            c[j] = c[j] * a[(j+1)%ILP] + b[(j+2)%ILP];
            f[j] = f[j] + d[(j+1)%ILP] + e[(j+2)%ILP];
        }
    }
    data[tid] = a[0] + f[0];
}

launched it on 3080 with ILP = 2 and made sure only one warp scheduled in smsp(4 warps per sm in total).
The performance was not as expect. I checked the SASS code in nsight compute. The main part of the code above was compiled into this form:

      IADD3 R13, R11, R14, R13    // 
      IMAD R9, R9, R4, R7 
      IADD3 R12, R10, R15, R12 
      IMAD R4, R4, R7, R6 
      IADD3 R15, R13, R10, R15 
      IMAD R5, R5, R16, R9 
      IADD3 R14, R12, R11, R14 
      IMAD R16, R16, R9, R4 
      IADD3 R11, R15, R12, R11 
      IMAD R7, R7, R6, R5 
      IADD3 R10, R14, R13, R10 
      IMAD R6, R6, R5, R16 

Each IMAD instruction is interleaved with an IADD3 instruction. Each instruction(no matter IMAD or IADD3) is dependent on the 4th instructions before. As both IMAD and IADD3’s dependent issue latency is 4 cycles, I thought utilization of issue slot should be 100%, fma should be 50%, ALU should be 100%. However, the real value is 64.76% for issue slot, 65.09% for ALU, 31.88% for FMA. I found in the source page that nearly half of warp stall sampling(all cycles) of almost all IADD3(except several instructions, perhaps 1 or 2) instructions is dispatch stall, the other half is selected. 100% of warp stall sampling(all cycles) of almost all IMAD(except several instructions, perhaps 1 or 2) instructions is selected.
So there is one cycle before each IADD3 instruction launched. Warp in that cycle is in dispatch stall state.
I didn’t find any detail about what cause dispatch stall in this case.
What causes it? How to avoid it(In my experiment, only one warp is allowed to run in each smsp)?

Attention:
It seems not all IADD3 instructions are delayed by one cycle to be issued .

I am not clear on how many dispatch stalls you are seeing, but took note of the following in the Nsight documentation:

Dispatch Stall: A pipeline interlock prevented instruction dispatch for a selected warp.

  • If dispatch stalls are higher than 5%, please file a bug to NVIDIA with reproducible.

I find the statement in the Nsight Graphics document. Maybe it’s not suitable to be by applied to CUDA program?

Don’t know. There is a subforum dedicated to Nsight Compute, where this question probably has a better chance of getting answered:

Thanks for submitting this question. It is a complex question that we will need some time to look over and figure out the best explanation. I have filed it with our engineering team and I can let you know when we have more information.

Thank you very much.
Do you have any official support to solve this kind of question? It’s better for me to contact your engineers directly.

The official channel for tools support is the forum. We take in the issue and file it with the other bugs and feature requests we get. It’s in the system now, and will be prioritized along with the other engineering work. There isn’t any faster channel via email etc… It all goes through the same system and is handled by the same team of engineers.

Any update? : )

We have this investigation scheduled in our queue along with all the other tools engineering work. It looks like it will likely be scheduled in one of the next couple of sprints based on current capacity. For reference, the internal bug ID Is 3866264.

Any update? Or is there any way I can follow the state of the internal bug by myself?
It really help me do further optimization if you could show more details of the hardware(especially ALU and FMA pipeline).

Thanks for you patience. We’ve done some internal investigation and determined that the dispatch stalls reported for your kernel are caused by exceeding the amount of registers that can be read from and written to the register file. Bandwidth to the register file is limited and the sequence of IMAD/IADD3 instructions can exceed this limit in some cases. Both instructions have a low instruction latency and require 3 input registers and 1 output register. Re-using inputs, reading from constant memory, or mixing in instructions with less register operands and outputs are possible way to avoid dispatch stalls

Thanks for your response. It really help.