ptxas optimization

Hello,

I tried getting the assembly listing of a very simple kernel function (below). I’m a bit baffled by the register count and usage. Unfortunately, I haven’t found anything resembling a reference so I’m guessing here. On the other hand, the assembly’s function isn’t difficult to guess at, even for a complete beginner like me. More complicated are the declarations.

Now then, my guess is: the code has a total of 7 registers, 3 for 32 bit and 4 for 64 bit (plus perhaps place for the arguments? I can’t discern this but the [font=“Lucida Console”].reg[/font] declarations suggest additional registers). Is this correct? If yes, why? Isn’t [font=“Lucida Console”]ptxas[/font] capable of optimizing this simple code (notice: optimization is set to -O3!). Is it faster not recycle registers? Also, is there really no direct conversion from [font=“Lucida Console”]u16[/font] to [font=“Lucida Console”]u64[/font]?

Here’s the code and the relevant PTX:

__global__ void kernel(int* data) {

	data[threadIdx.x] *= 2;

}
.entry __globfunc__Z6kernelPi

	{

	.reg .u32 %r<5>;

	.reg .u64 %rd<6>;

	.param .u64 __cudaparm___globfunc__Z6kernelPi_data;

	.loc	16  9   0

$LBB1___globfunc__Z6kernelPi:

	.loc	16  10  0

	ld.param.u64	%rd1, [__cudaparm___globfunc__Z6kernelPi_data];

	cvt.u32.u16	 %r1, %tid.x;

	cvt.u64.u32	 %rd2, %r1;

	mul.lo.u64  %rd3, %rd2, 4;

	add.u64	 %rd4, %rd1, %rd3;

	ld.global.s32   %r2, [%rd4+0];

	mul.lo.s32  %r3, %r2, 2;

	st.global.s32   [%rd4+0], %r3;

	.loc	16  11  0

	exit;

$LDWend___globfunc__Z6kernelPi:

	} // __globfunc__Z6kernelPi

Basically, what speaks against the following hypothetical code? I.e. why not recycle registers that are not used any more?

ld.param.u64   %rd1, [__cudaparm___globfunc__Z6kernelPi_data];

cvt.u64.u16	%rd2, %tid.x;

mul.lo.u64	 %rd2, %rd2, 4;

add.u46		%rd2, %rd1, %rd2;

ld.global.s32   %r2, [%rd2+0];

mul.lo.s32	  %r2, %r2, 2;

st.global.s32   [%rd2+0], %r2;

PTX is an intermediate, low level, representation emitted by nvcc and consumed by ptxas. The PTX produced by nvcc uses static-single assignment form:

http://en.wikipedia.org/wiki/Static_single_assignment_form

Final register allocation is done by ptxas when it produces a cubin.

Ah, thanks. If it’s in SSA then the code makes more sense.

So is there a way of getting the final register distribution short of using the decuda disassembler on the [font=“Lucida Console”].cubin[/font] binary?

If you run ptxas with -v (or nvcc with --ptxas-options=-v) then you can see the number of registers used in the final cubin. If you want to see the actual usage of the registers, you have to use decuda.

Again, thanks for the help. had I used “-v” from the start I would have seen that the actual number of registers is 2, just as predicted.