Why there is a dispatch stall when issue almost every IADD3 instruction in my kernel on 3080?

I’m sorry I post here again because I can’t move the original post What cause dispatch stall? How to avoid it? into this subforum and it really bothered me for a few days. I found no document to explain what I got in experiment. So i cannot do optimization to get better performance.

My kernel:

__global__
void kernel_IADD3(uint32_t *data) {
    //Initialize a[2], b[2], c[2], d[2], e[2], f[2]

    for (int i = 0; i < 200000; i++) {
        for (int j = 0; j < 2; j++) {
            a[j] = a[j] * b[(j+1)%2] + c[(j+2)%2];
            d[j] = d[j] + e[(j+1)%2] + f[(j+2)%2];
        }
        for (int j = 0; j < 2; j++) {
            b[j] = b[j] * c[(j+1)%2] + a[(j+2)%2];
            e[j] = e[j] + f[(j+1)%2] + d[(j+2)%2];
        }
        for (int j = 0; j < 2; j++) {
            c[j] = c[j] * a[(j+1)%2] + b[(j+2)%2];
            f[j] = f[j] + d[(j+1)%2] + e[(j+2)%2];
        }
    }
   //Write to global memory
}

The out most for loop is compiled into a SASS loop with IADD3 instructions interleaved by IMAD instructions, like this:

      IADD3 R5, R15, R12, R11 
      IMAD R6, R7, R8, R19 
      IADD3 R7, R20, R13, R10 
      IMAD R9, R8, R19, R4 
      IADD3 R8, R5, R20, R13 
      IMAD R11, R17, R18, R6 
      IADD3 R10, R7, R15, R12 

I don’t know why there is a dispatch stall in almost every IADD3 instruction. It takes 2 cycles to issue next IADD3 instruction after each IMAD instruction instead of 1.

Maybe it’s caused by register access pattern or something else. I don’t know.
Any idea?

The profile file of Nsight Compute:
dispatch_stall.ncu-rep (286.4 KB)

I checked the machine code of IMAD instructions and IADD3 instructions.

        /*0230*/                   IADD3 R12, R10, R15, R12 ;               /* 0x0000000f0a0c7210 */
                                                                            /* 0x000fe20007ffe00c */
        /*0240*/                   IMAD R4, R4, R7, R6 ;                    /* 0x0000000704047224 */
                                                                            /* 0x000fe200078e0206 */
        /*0250*/                   IADD3 R15, R13, R10, R15 ;               /* 0x0000000a0d0f7210 */
                                                                            /* 0x000fe20007ffe00f */
        /*0260*/                   IMAD R5, R5, R16, R9 ;                   /* 0x0000001005057224 */
                                                                            /* 0x000fe200078e0209 */

If my understanding is correct, the upper 3 bit of the 6th hexadecimal number in every second line indicates stall count. Maybe the lowest bit of the 5th hexadecimal number is also a part of stall count. But it doesn’t matter in this case.
As you see, the 6th hexadecimal number in every second line is 2. So dispatch unit should issue one instruction in each cycle. But before issue IADD3 instruction, there is a dispatch stall for one cycle. This can be inferred from data in Nsight Compute.

You could try filing a bug. I assume the reason the Nsight manual instructs users to file a bug with NVIDIA if there are too many dispatch stalls is because these stalls are under software control, supplied by the op-steering information associated with each instruction (for which no official documentation is made publicly available). I assume that the pxtas backend of the compiler is responsible for filling in that op-steering data.

So a compiler issue may be to blame. It is also possible that there is a hardware limitation that NVIDIA has not told anybody outside the company about and the compiler is doing the best it can. Filing a bug would likely result in NVIDIA explaining which of these scenarios applies, but without supplying additional architectural details.

I have not seen any forum participants actively reverse engineering NVIDIA GPUs in these forums recently, so I think the chances of resolving this here are slim. If you work for a larger organization that enjoys an elevated level of developer support from NVIDIA, you might want to work with the designated contact persons to get an answer.

Thanks for your advice. I will try.

Perhaps a misunderstanding on my part, but is it due to register reuse?

I understand the dependent-issue latency is 4 cycles and looking at SASS line /0230/, you have an operand in R15. The next IADD3 wants to write to R15, so would it not require a wait of 4 cycles to proceed?

Maybe it’s the reason why the instruction at line 250 must wait one cycle.
But it cannot explain why other IADD3 instructions also needs to wait one cycle.
Screenshot of sampling data in Nsight Compute:


The output of the third IADD3 instruction R14 is read by the last instruction. Its input is written by first instruction. It should take 4 cycles to issue every third IADD3 instruction(including one cycle to issue current IADD3 instruction).
So I don’t think read after write cause almost all IADD3 stalled by one cycle.

Have you considered how register banks may play into this? Or would that show up in a different register bank conflict counter?

I have no idea how register works :(
Maybe that’s why I can’t explain what happened

If you want to know, you would have to reverse engineer the details yourself. Historically, NVIDIA has been consistent for the past 15 years of shipping CUDA: They do not share this level of detail about the microarchitecture of their GPUs.

This means GPUs are not the kind of platform where programmers not working for the hardware vendor can tightly control how software executes on the hardware (except for the rare person who enjoys reverse engineering much of the microarchitecture and ISA, and then willing to write their own tools, e.g. Scott Gray for Maxwell).

In practical terms, by filing a bug you can ensure that the compiler is performing to the best level that NVIDIA is capable of delivering.

It’s really hard for me to reverse engineer. That’s why I come here for help :)

Norbert may have a point re. register conflicts - Scott Gray states a one clock stall for a conflict.

Page 6 here outlines register conflicts as they pertain to Volta and there’s a reasonable chance Ampere is the same.

Later: Maybe a red herring, having just applied that info to the screengrab above - none of the IADD3 instructions have all three operands in the same bank.

Yes, the wait cycles for preventing stalls are the 4 bits at the location you specified. The second-lowest bit of the 5th hexadecimal is the yield flag. When it is 0, the scheduler is recommended to switch warps. As long as there is no yield, a register reuse cache can be used to prevent register bank conflicts. Caching the current register operand is activated by bits 122, 123, 124 and 50 (for the first to a possible fourth source operand). If there is a yield, bits 122-124 have a different meaning (probably more hints to the scheduler).