I’m sorry I post here again because I can’t move the original post What cause dispatch stall? How to avoid it? into this subforum and it really bothered me for a few days. I found no document to explain what I got in experiment. So i cannot do optimization to get better performance.
My kernel:
__global__
void kernel_IADD3(uint32_t *data) {
//Initialize a[2], b[2], c[2], d[2], e[2], f[2]
for (int i = 0; i < 200000; i++) {
for (int j = 0; j < 2; j++) {
a[j] = a[j] * b[(j+1)%2] + c[(j+2)%2];
d[j] = d[j] + e[(j+1)%2] + f[(j+2)%2];
}
for (int j = 0; j < 2; j++) {
b[j] = b[j] * c[(j+1)%2] + a[(j+2)%2];
e[j] = e[j] + f[(j+1)%2] + d[(j+2)%2];
}
for (int j = 0; j < 2; j++) {
c[j] = c[j] * a[(j+1)%2] + b[(j+2)%2];
f[j] = f[j] + d[(j+1)%2] + e[(j+2)%2];
}
}
//Write to global memory
}
The out most for loop is compiled into a SASS loop with IADD3 instructions interleaved by IMAD instructions, like this:
IADD3 R5, R15, R12, R11
IMAD R6, R7, R8, R19
IADD3 R7, R20, R13, R10
IMAD R9, R8, R19, R4
IADD3 R8, R5, R20, R13
IMAD R11, R17, R18, R6
IADD3 R10, R7, R15, R12
I don’t know why there is a dispatch stall in almost every IADD3 instruction. It takes 2 cycles to issue next IADD3 instruction after each IMAD instruction instead of 1.
Maybe it’s caused by register access pattern or something else. I don’t know.
Any idea?
The profile file of Nsight Compute:
dispatch_stall.ncu-rep (286.4 KB)