Why there is a dispatch stall when issue almost every IADD3 instruction in my kernel on 3080?

SparkHu · November 10, 2022, 7:13am

I’m sorry I post here again because I can’t move the original post What cause dispatch stall? How to avoid it? into this subforum and it really bothered me for a few days. I found no document to explain what I got in experiment. So i cannot do optimization to get better performance.

My kernel:

__global__
void kernel_IADD3(uint32_t *data) {
    //Initialize a[2], b[2], c[2], d[2], e[2], f[2]

    for (int i = 0; i < 200000; i++) {
        for (int j = 0; j < 2; j++) {
            a[j] = a[j] * b[(j+1)%2] + c[(j+2)%2];
            d[j] = d[j] + e[(j+1)%2] + f[(j+2)%2];
        }
        for (int j = 0; j < 2; j++) {
            b[j] = b[j] * c[(j+1)%2] + a[(j+2)%2];
            e[j] = e[j] + f[(j+1)%2] + d[(j+2)%2];
        }
        for (int j = 0; j < 2; j++) {
            c[j] = c[j] * a[(j+1)%2] + b[(j+2)%2];
            f[j] = f[j] + d[(j+1)%2] + e[(j+2)%2];
        }
    }
   //Write to global memory
}

The out most for loop is compiled into a SASS loop with IADD3 instructions interleaved by IMAD instructions, like this:

      IADD3 R5, R15, R12, R11 
      IMAD R6, R7, R8, R19 
      IADD3 R7, R20, R13, R10 
      IMAD R9, R8, R19, R4 
      IADD3 R8, R5, R20, R13 
      IMAD R11, R17, R18, R6 
      IADD3 R10, R7, R15, R12

I don’t know why there is a dispatch stall in almost every IADD3 instruction. It takes 2 cycles to issue next IADD3 instruction after each IMAD instruction instead of 1.

Maybe it’s caused by register access pattern or something else. I don’t know.
Any idea?

The profile file of Nsight Compute:
dispatch_stall.ncu-rep (286.4 KB)

SparkHu · November 10, 2022, 7:26am

I checked the machine code of IMAD instructions and IADD3 instructions.

        /*0230*/                   IADD3 R12, R10, R15, R12 ;               /* 0x0000000f0a0c7210 */
                                                                            /* 0x000fe20007ffe00c */
        /*0240*/                   IMAD R4, R4, R7, R6 ;                    /* 0x0000000704047224 */
                                                                            /* 0x000fe200078e0206 */
        /*0250*/                   IADD3 R15, R13, R10, R15 ;               /* 0x0000000a0d0f7210 */
                                                                            /* 0x000fe20007ffe00f */
        /*0260*/                   IMAD R5, R5, R16, R9 ;                   /* 0x0000001005057224 */
                                                                            /* 0x000fe200078e0209 */

If my understanding is correct, the upper 3 bit of the 6th hexadecimal number in every second line indicates stall count. Maybe the lowest bit of the 5th hexadecimal number is also a part of stall count. But it doesn’t matter in this case.
As you see, the 6th hexadecimal number in every second line is 2. So dispatch unit should issue one instruction in each cycle. But before issue IADD3 instruction, there is a dispatch stall for one cycle. This can be inferred from data in Nsight Compute.

njuffa · November 10, 2022, 7:53am

You could try filing a bug. I assume the reason the Nsight manual instructs users to file a bug with NVIDIA if there are too many dispatch stalls is because these stalls are under software control, supplied by the op-steering information associated with each instruction (for which no official documentation is made publicly available). I assume that the pxtas backend of the compiler is responsible for filling in that op-steering data.

So a compiler issue may be to blame. It is also possible that there is a hardware limitation that NVIDIA has not told anybody outside the company about and the compiler is doing the best it can. Filing a bug would likely result in NVIDIA explaining which of these scenarios applies, but without supplying additional architectural details.

I have not seen any forum participants actively reverse engineering NVIDIA GPUs in these forums recently, so I think the chances of resolving this here are slim. If you work for a larger organization that enjoys an elevated level of developer support from NVIDIA, you might want to work with the designated contact persons to get an answer.

SparkHu · November 10, 2022, 8:12am

Thanks for your advice. I will try.

rs277 · November 10, 2022, 8:18am

Perhaps a misunderstanding on my part, but is it due to register reuse?

I understand the dependent-issue latency is 4 cycles and looking at SASS line /0230/, you have an operand in R15. The next IADD3 wants to write to R15, so would it not require a wait of 4 cycles to proceed?

SparkHu · November 10, 2022, 8:43am

Maybe it’s the reason why the instruction at line 250 must wait one cycle.
But it cannot explain why other IADD3 instructions also needs to wait one cycle.
Screenshot of sampling data in Nsight Compute:

The output of the third IADD3 instruction R14 is read by the last instruction. Its input is written by first instruction. It should take 4 cycles to issue every third IADD3 instruction(including one cycle to issue current IADD3 instruction).
So I don’t think read after write cause almost all IADD3 stalled by one cycle.

njuffa · November 10, 2022, 8:47am

Have you considered how register banks may play into this? Or would that show up in a different register bank conflict counter?

SparkHu · November 10, 2022, 8:50am

I have no idea how register works :(
Maybe that’s why I can’t explain what happened

njuffa · November 10, 2022, 9:02am

If you want to know, you would have to reverse engineer the details yourself. Historically, NVIDIA has been consistent for the past 15 years of shipping CUDA: They do not share this level of detail about the microarchitecture of their GPUs.

This means GPUs are not the kind of platform where programmers not working for the hardware vendor can tightly control how software executes on the hardware (except for the rare person who enjoys reverse engineering much of the microarchitecture and ISA, and then willing to write their own tools, e.g. Scott Gray for Maxwell).

In practical terms, by filing a bug you can ensure that the compiler is performing to the best level that NVIDIA is capable of delivering.

SparkHu · November 10, 2022, 9:22am

It’s really hard for me to reverse engineer. That’s why I come here for help :)

rs277 · November 10, 2022, 9:26am

Norbert may have a point re. register conflicts - Scott Gray states a one clock stall for a conflict.

Page 6 here outlines register conflicts as they pertain to Volta and there’s a reasonable chance Ampere is the same.

Later: Maybe a red herring, having just applied that info to the screengrab above - none of the IADD3 instructions have all three operands in the same bank.

Curefab · November 11, 2022, 7:37pm

Yes, the wait cycles for preventing stalls are the 4 bits at the location you specified. The second-lowest bit of the 5th hexadecimal is the yield flag. When it is 0, the scheduler is recommended to switch warps. As long as there is no yield, a register reuse cache can be used to prevent register bank conflicts. Caching the current register operand is activated by bits 122, 123, 124 and 50 (for the first to a possible fourth source operand). If there is a yield, bits 122-124 have a different meaning (probably more hints to the scheduler).

Topic		Replies	Views
What cause dispatch stall? How to avoid it? Nsight Compute cuda	11	1771	February 9, 2023
Performance penalty due to warp divergence CUDA Programming and Performance	9	1647	May 18, 2023
cant understand this performance hit CUDA Programming and Performance	23	4597	May 17, 2010
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6115	December 2, 2022
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1393	March 25, 2022
Register usage of a device function for vector rotation CUDA Programming and Performance	14	703	June 12, 2022
Lowering register usage CUDA Programming and Performance	14	4528	October 10, 2008
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3266	January 20, 2019
Does the STG.E instruction on Ampere occupy two clock cycles of the FMAHeavy pipeline? CUDA Programming and Performance	11	819	December 10, 2023
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1559	February 14, 2019

Why there is a dispatch stall when issue almost every IADD3 instruction in my kernel on 3080?

Related topics