I wrote a kernel:
template<uint32_t ILP>
__global__
void kernel_IADD3(uint32_t *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
uint32_t a[ILP] = {}, b[ILP] = {}, c[ILP], d[ILP] = {}, e[ILP] = {}, f[ILP];
for (int i = 0; i < ILP; i++) {
c[i] = data[tid + i];
f[i] = data[tid + i];
}
for (int i = 0; i < 200000; i++) {
for (int j = 0; j < ILP; j++) {
a[j] = a[j] * b[(j+1)%ILP] + c[(j+2)%ILP];
d[j] = d[j] + e[(j+1)%ILP] + f[(j+2)%ILP];
}
for (int j = 0; j < ILP; j++) {
b[j] = b[j] * c[(j+1)%ILP] + a[(j+2)%ILP];
e[j] = e[j] + f[(j+1)%ILP] + d[(j+2)%ILP];
}
for (int j = 0; j < ILP; j++) {
c[j] = c[j] * a[(j+1)%ILP] + b[(j+2)%ILP];
f[j] = f[j] + d[(j+1)%ILP] + e[(j+2)%ILP];
}
}
data[tid] = a[0] + f[0];
}
launched it on 3080 with ILP = 2 and made sure only one warp scheduled in smsp(4 warps per sm in total).
The performance was not as expect. I checked the SASS code in nsight compute. The main part of the code above was compiled into this form:
IADD3 R13, R11, R14, R13 //
IMAD R9, R9, R4, R7
IADD3 R12, R10, R15, R12
IMAD R4, R4, R7, R6
IADD3 R15, R13, R10, R15
IMAD R5, R5, R16, R9
IADD3 R14, R12, R11, R14
IMAD R16, R16, R9, R4
IADD3 R11, R15, R12, R11
IMAD R7, R7, R6, R5
IADD3 R10, R14, R13, R10
IMAD R6, R6, R5, R16
Each IMAD instruction is interleaved with an IADD3 instruction. Each instruction(no matter IMAD or IADD3) is dependent on the 4th instructions before. As both IMAD and IADD3’s dependent issue latency is 4 cycles, I thought utilization of issue slot should be 100%, fma should be 50%, ALU should be 100%. However, the real value is 64.76% for issue slot, 65.09% for ALU, 31.88% for FMA. I found in the source page that nearly half of warp stall sampling(all cycles) of almost all IADD3(except several instructions, perhaps 1 or 2) instructions is dispatch stall, the other half is selected. 100% of warp stall sampling(all cycles) of almost all IMAD(except several instructions, perhaps 1 or 2) instructions is selected.
So there is one cycle before each IADD3 instruction launched. Warp in that cycle is in dispatch stall state.
I didn’t find any detail about what cause dispatch stall in this case.
What causes it? How to avoid it(In my experiment, only one warp is allowed to run in each smsp)?
Attention:
It seems not all IADD3 instructions are delayed by one cycle to be issued .