After carefully examined the generated SASS code, I’m not sure it’s the problem of register allocation, or the compiler is reluctant or unable to break long dependency chains. Let’s see a simple concocted example:
__global__ void doLerp(const float* __restrict__ x, const float* __restrict__ y, float* __restrict__ v)
{
unsigned int tid = threadIdx.x;
float s = x[tid];
s = max(min(s, 0.0f), 1022.99f);
// Path 1: integer part
int is = int(s) / 4 + 1;
float v0 = y[is];
float v1 = y[is + 1];
// Path 2: float part
float qs = s - floorf(s) - 0.5f;
qs = __saturatef(qs * 2.0f);
qs = qs*qs;
// merge again
v[tid] = fma(qs, v1, fma(qs, -v0, v0));
}
The program is based on the simple lerp. Please don’t get confused about the math here, I just added some additional operations to elongate the dependency chain, they won’t make too much sence in mathematics, but still cannot ignored by the compiler. So forget about the math here. We just need to check how the compiler handled those instructions (build by cuda9.2 for arch sm_52):
Function : _Z6doLerpPKfS0_Pf
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x083fc400e3e007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ S2R R0, SR_TID.X; /* 0xf0c8000002170000 */
/*0018*/ SHL R9, R0.reuse, 0x2; /* 0x3848000000270009 */
/* 0x001fc800fec007f5 */
/*0028*/ SHR.U32 R8, R0, 0x1e; /* 0x3828000001e70008 */
/*0030*/ IADD R2.CC, R9, c[0x0][0x140]; /* 0x4c10800005070902 */
/*0038*/ IADD.X R3, R8, c[0x0][0x144]; /* 0x4c10080005170803 */
/* 0x001fc820fcc007b5 */
/*0048*/ LDG.E.CI R2, [R2]; /* 0xeed4a00000070202 */
/*0050*/ FMNMX R0, RZ, R2, PT; /* 0x5c6003800027ff00 */
/*0058*/ FMNMX R0, R0, c[0x2][0x0], !PT; /* 0x4c60078800070000 */
/* 0x001fd801fec0071d */
/*0068*/ F2I.S32.F32.TRUNC R4, R0; /* 0x5cb0018000071a04 */ // path 1
/*0070*/ SHR R5, R4, 0x1f; /* 0x3829000001f70405 */ // R4 everywhere
/*0078*/ LEA.HI R4, R5, R4, RZ, 0x2; /* 0x5bdf7f8020470504 */
/* 0x001f8400fec007f6 */
/*0088*/ SHR R4, R4, 0x2; /* 0x3829000000270404 */
/*0090*/ IADD32I R4, R4, 0x1; /* 0x1c00000000170404 */
/*0098*/ SHR R5, R4, 0x1e; /* 0x3829000001e70405 */
/* 0x001c4800fe0007f6 */
/*00a8*/ ISCADD R2.CC, R4, c[0x0][0x148], 0x2; /* 0x4c18810005270402 */
/*00b0*/ { IADD.X R3, R5, c[0x0][0x14c]; /* 0x4c10080005370503 */
/*00b8*/ F2F.F32.F32.FLOOR R5, R0; } /* 0x5ca8048000070a05 */ // path 2
/* 0x001ec400f62007f0 */
/*00c8*/ { IADD R9.CC, R9, c[0x0][0x150]; /* 0x4c10800005470909 */
/*00d0*/ LDG.E.CI R4, [R2]; } /* 0xeed4a00000070204 */
/*00d8*/ LDG.E.CI R6, [R2+0x4]; /* 0xeed4a00000470206 */
/* 0x001f9800ffa00ff6 */
/*00e8*/ FADD R5, R0, -R5; /* 0x5c58200000570005 */ // R5 everywhere
/*00f0*/ FADD R5, R5, -0.5; /* 0x3958003f00070505 */
/*00f8*/ FADD.SAT R5, R5, R5; /* 0x5c5c000000570505 */
/* 0x141fc400fea007f1 */
/*0108*/ FMUL R7, R5, R5; /* 0x5c68000000570507 */
/*0110*/ IADD.X R5, R8, c[0x0][0x154]; /* 0x4c10080005570805 */
/*0118*/ FFMA R8, R4, -R7.reuse, R4; /* 0x5981020000770408 */
/* 0x001fc400fe4007f5 */
/*0128*/ MOV R4, R9; /* 0x5c98078000970004 */
/*0130*/ FFMA R6, R6, R7, R8; /* 0x5980040000770606 */
/*0138*/ STG.E [R4], R6; /* 0xeedc200000070406 */
/* 0x001f8000ffe007ff */
/*0148*/ EXIT; /* 0xe30000000007000f */
/*0150*/ BRA 0x150; /* 0xe2400fffff87000f */
/*0158*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*0168*/ NOP; /* 0x50b0000000070f00 */
/*0170*/ NOP; /* 0x50b0000000070f00 */
/*0178*/ NOP; /* 0x50b0000000070f00 */
..................................
The progam contains two independent pathes, diverged from the fraction index, and merged at final lerp.
Every path consists of a long chain of back-to-back dependent instructions. Those two pathes are totally independent and could be mixed or interleaved to alleviate the dependency problem. But the compiler utilized one registers heavily (R4 for path 1, and R5 for path 2) to build almost all the instructions in that path.
I tried to rearrange the source code to help the compiler to interleave those instructions:
__global__ void doLerp_Mix(const float* __restrict__ x, const float* __restrict__ y, float* __restrict__ v)
{
unsigned int tid = threadIdx.x;
float s = x[tid];
s = max(min(s, 0.0f), 1022.99f);
int is = int(s) / 4 + 1;
float qs = s - floorf(s) - 0.5f;
float v0 = y[is];
qs = __saturatef(qs * 2.0f);
float v1 = y[is + 1];
qs = qs*qs;
// merge again
v[tid] = fma(qs, v1, fma(qs, -v0, v0));
}
But the generated code does not seem to change at all:
Function : _Z10doLerp_MixPKfS0_Pf
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x083fc400e3e007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ S2R R0, SR_TID.X; /* 0xf0c8000002170000 */
/*0018*/ SHL R9, R0.reuse, 0x2; /* 0x3848000000270009 */
/* 0x001fc800fec007f5 */
/*0028*/ SHR.U32 R8, R0, 0x1e; /* 0x3828000001e70008 */
/*0030*/ IADD R2.CC, R9, c[0x0][0x140]; /* 0x4c10800005070902 */
/*0038*/ IADD.X R3, R8, c[0x0][0x144]; /* 0x4c10080005170803 */
/* 0x001fc820fcc007b5 */
/*0048*/ LDG.E.CI R2, [R2]; /* 0xeed4a00000070202 */
/*0050*/ FMNMX R0, RZ, R2, PT; /* 0x5c6003800027ff00 */
/*0058*/ FMNMX R0, R0, c[0x2][0x0], !PT; /* 0x4c60078800070000 */
/* 0x001fd801fec0071d */
/*0068*/ F2I.S32.F32.TRUNC R4, R0; /* 0x5cb0018000071a04 */
/*0070*/ SHR R5, R4, 0x1f; /* 0x3829000001f70405 */
/*0078*/ LEA.HI R4, R5, R4, RZ, 0x2; /* 0x5bdf7f8020470504 */
/* 0x001f8400fec007f6 */
/*0088*/ SHR R4, R4, 0x2; /* 0x3829000000270404 */
/*0090*/ IADD32I R4, R4, 0x1; /* 0x1c00000000170404 */
/*0098*/ SHR R5, R4, 0x1e; /* 0x3829000001e70405 */
/* 0x001c4800fe0007f6 */
/*00a8*/ ISCADD R2.CC, R4, c[0x0][0x148], 0x2; /* 0x4c18810005270402 */
/*00b0*/ { IADD.X R3, R5, c[0x0][0x14c]; /* 0x4c10080005370503 */
/*00b8*/ F2F.F32.F32.FLOOR R5, R0; } /* 0x5ca8048000070a05 */
/* 0x001ec400f62007f0 */
/*00c8*/ { IADD R9.CC, R9, c[0x0][0x150]; /* 0x4c10800005470909 */
/*00d0*/ LDG.E.CI R4, [R2]; } /* 0xeed4a00000070204 */
/*00d8*/ LDG.E.CI R6, [R2+0x4]; /* 0xeed4a00000470206 */
/* 0x001f9800ffa00ff6 */
/*00e8*/ FADD R5, R0, -R5; /* 0x5c58200000570005 */
/*00f0*/ FADD R5, R5, -0.5; /* 0x3958003f00070505 */
/*00f8*/ FADD.SAT R5, R5, R5; /* 0x5c5c000000570505 */
/* 0x141fc400fea007f1 */
/*0108*/ FMUL R7, R5, R5; /* 0x5c68000000570507 */
/*0110*/ IADD.X R5, R8, c[0x0][0x154]; /* 0x4c10080005570805 */
/*0118*/ FFMA R8, R4, -R7.reuse, R4; /* 0x5981020000770408 */
/* 0x001fc400fe4007f5 */
/*0128*/ MOV R4, R9; /* 0x5c98078000970004 */
/*0130*/ FFMA R6, R6, R7, R8; /* 0x5980040000770606 */
/*0138*/ STG.E [R4], R6; /* 0xeedc200000070406 */
/* 0x001f8000ffe007ff */
/*0148*/ EXIT; /* 0xe30000000007000f */
/*0150*/ BRA 0x150; /* 0xe2400fffff87000f */
/*0158*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*0168*/ NOP; /* 0x50b0000000070f00 */
/*0170*/ NOP; /* 0x50b0000000070f00 */
/*0178*/ NOP; /* 0x50b0000000070f00 */
.......................................
The registers don’t worth a concern since it took only 10 registers here, far below the minimum avaiable number 32 for sm_52.It seems to me the compiler chose this type of code deliberately other than limited by the order of input sequences.
I don’t quite understand the strategy the compiler used here, but I thought there could be some benefits to interleave those two pathes, providing that several extra registers are available. I have not check it for latetest sm_75, there could be even more benefits since float and integer units can be utilized concurrently in turing. I’ve confirmed the compiler is capable of mixing independent float and integer operations for some occasions, but I don’t know what prevents the compiler doing so for this case. Probably some other performance issues? If this is the way it designed to be, would you please explain why?