Hello,
I tried getting the assembly listing of a very simple kernel function (below). I’m a bit baffled by the register count and usage. Unfortunately, I haven’t found anything resembling a reference so I’m guessing here. On the other hand, the assembly’s function isn’t difficult to guess at, even for a complete beginner like me. More complicated are the declarations.
Now then, my guess is: the code has a total of 7 registers, 3 for 32 bit and 4 for 64 bit (plus perhaps place for the arguments? I can’t discern this but the [font=“Lucida Console”].reg[/font] declarations suggest additional registers). Is this correct? If yes, why? Isn’t [font=“Lucida Console”]ptxas[/font] capable of optimizing this simple code (notice: optimization is set to -O3!). Is it faster not recycle registers? Also, is there really no direct conversion from [font=“Lucida Console”]u16[/font] to [font=“Lucida Console”]u64[/font]?
Here’s the code and the relevant PTX:
__global__ void kernel(int* data) {
data[threadIdx.x] *= 2;
}
.entry __globfunc__Z6kernelPi
{
.reg .u32 %r<5>;
.reg .u64 %rd<6>;
.param .u64 __cudaparm___globfunc__Z6kernelPi_data;
.loc 16 9 0
$LBB1___globfunc__Z6kernelPi:
.loc 16 10 0
ld.param.u64 %rd1, [__cudaparm___globfunc__Z6kernelPi_data];
cvt.u32.u16 %r1, %tid.x;
cvt.u64.u32 %rd2, %r1;
mul.lo.u64 %rd3, %rd2, 4;
add.u64 %rd4, %rd1, %rd3;
ld.global.s32 %r2, [%rd4+0];
mul.lo.s32 %r3, %r2, 2;
st.global.s32 [%rd4+0], %r3;
.loc 16 11 0
exit;
$LDWend___globfunc__Z6kernelPi:
} // __globfunc__Z6kernelPi
Basically, what speaks against the following hypothetical code? I.e. why not recycle registers that are not used any more?
ld.param.u64 %rd1, [__cudaparm___globfunc__Z6kernelPi_data];
cvt.u64.u16 %rd2, %tid.x;
mul.lo.u64 %rd2, %rd2, 4;
add.u46 %rd2, %rd1, %rd2;
ld.global.s32 %r2, [%rd2+0];
mul.lo.s32 %r2, %r2, 2;
st.global.s32 [%rd2+0], %r2;