I’m messing around with some unrolling right now and have the code handy, so I’d thought I’d illustrate njuffa’s first point with some actual code.
Here we have a very simple first pass reduction loop:
for (int i = tid; i < n; i += THREADS)
{
sum += __ldg(in);
in += THREADS;
}
And here is the sass that gets generated:
TARGET4:
MOV R12, R4;
LDG.E.CI R6, [R4];
IADD32I R12.CC, R12, 0x2000;
LDG.E.CI R9, [R4+0x800];
IADD32I R8, R8, 0x800;
LDG.E.CI R10, [R4+0x1000];
MOV32I R13, 0xfffffa00;
LDG.E.CI R11, [R4+0x1800];
MOV R14, param_3;
FADD R6, R7, R6;
MOV R7, R5;
FADD R5, R6, R9;
IADD.X R7, RZ, R7;
VABSDIFF.ACC RZ.CC, R8, R14, R13;
DEPBAR.LE SB5, 0x1;
FADD R4, R5, R10;
MOV R5, R7;
FADD R6, R4, R11;
MOV R4, R12;
MOV R7, R6;
BRA CC.GT, TARGET4;
SYNC;
TARGET3:
ISETP.LT.AND P0, PT, R8, param_n, PT;
@!P0 SYNC;
TARGET2:
SSY TARGET5;
TARGET6:
IADD32I R8, R8, 0x200;
LDG.E.CI R6, [R4];
IADD32I R4.CC, R4, 0x800;
ISETP.LT.AND P0, PT, R8, param_n, PT;
IADD.X R5, RZ, R5;
FADD R6, R7, R6;
MOV R7, R6;
@P0 BRA TARGET6;
SYNC;
TARGET5:
SYNC;
So the simple loop gets broken up into two loops. One that’s unrolled 4 times and one that’s not unrolled. The big loop handles most of the iterations, and the small one handles the ones at the end that don’t fit in a multiple of 4.
Looking at the big loop there is a clear advantage to having it unrolled 4 times. You can queue up 4 memory loads (LDG instructions) before issuing the first FADD instruction. So you’ve basically cut down the total loop latency by a factor of 4. The additional 3 loads don’t add much to the total latency since the bulk of their latencies are covered in the first load. I’m not sure why the compiler chooses 4 as the unroll factor but I’m assuming the cuda team has benchmarked this kind of code heavily and figured out 4 gives good performance (or maybe they’re just working from theoretical hardware knowledge).
A small number of additional registers are required for this unrolling, but in this case the trade off is almost certainly well worth it.