ptxas optimization

Konrad · January 8, 2009, 12:31pm

Hello,

I tried getting the assembly listing of a very simple kernel function (below). I’m a bit baffled by the register count and usage. Unfortunately, I haven’t found anything resembling a reference so I’m guessing here. On the other hand, the assembly’s function isn’t difficult to guess at, even for a complete beginner like me. More complicated are the declarations.

Now then, my guess is: the code has a total of 7 registers, 3 for 32 bit and 4 for 64 bit (plus perhaps place for the arguments? I can’t discern this but the [font=“Lucida Console”].reg[/font] declarations suggest additional registers). Is this correct? If yes, why? Isn’t [font=“Lucida Console”]ptxas[/font] capable of optimizing this simple code (notice: optimization is set to -O3!). Is it faster not recycle registers? Also, is there really no direct conversion from [font=“Lucida Console”]u16[/font] to [font=“Lucida Console”]u64[/font]?

Here’s the code and the relevant PTX:

__global__ void kernel(int* data) {

	data[threadIdx.x] *= 2;

}

.entry __globfunc__Z6kernelPi

	{

	.reg .u32 %r<5>;

	.reg .u64 %rd<6>;

	.param .u64 __cudaparm___globfunc__Z6kernelPi_data;

	.loc	16  9   0

$LBB1___globfunc__Z6kernelPi:

	.loc	16  10  0

	ld.param.u64	%rd1, [__cudaparm___globfunc__Z6kernelPi_data];

	cvt.u32.u16	 %r1, %tid.x;

	cvt.u64.u32	 %rd2, %r1;

	mul.lo.u64  %rd3, %rd2, 4;

	add.u64	 %rd4, %rd1, %rd3;

	ld.global.s32   %r2, [%rd4+0];

	mul.lo.s32  %r3, %r2, 2;

	st.global.s32   [%rd4+0], %r3;

	.loc	16  11  0

	exit;

$LDWend___globfunc__Z6kernelPi:

	} // __globfunc__Z6kernelPi

Basically, what speaks against the following hypothetical code? I.e. why not recycle registers that are not used any more?

ld.param.u64   %rd1, [__cudaparm___globfunc__Z6kernelPi_data];

cvt.u64.u16	%rd2, %tid.x;

mul.lo.u64	 %rd2, %rd2, 4;

add.u46		%rd2, %rd1, %rd2;

ld.global.s32   %r2, [%rd2+0];

mul.lo.s32	  %r2, %r2, 2;

st.global.s32   [%rd2+0], %r2;

seibert · January 8, 2009, 3:26pm

PTX is an intermediate, low level, representation emitted by nvcc and consumed by ptxas. The PTX produced by nvcc uses static-single assignment form:

[url=“Static single-assignment form - Wikipedia”]http://en.wikipedia.org/wiki/Static_single_assignment_form[/url]

Final register allocation is done by ptxas when it produces a cubin.

Konrad · January 8, 2009, 3:45pm

Ah, thanks. If it’s in SSA then the code makes more sense.

So is there a way of getting the final register distribution short of using the decuda disassembler on the [font=“Lucida Console”].cubin[/font] binary?

seibert · January 8, 2009, 4:56pm

If you run ptxas with -v (or nvcc with --ptxas-options=-v) then you can see the number of registers used in the final cubin. If you want to see the actual usage of the registers, you have to use decuda.

Konrad · January 9, 2009, 10:11am

Again, thanks for the help. had I used “-v” from the start I would have seen that the actual number of registers is 2, just as predicted.

Topic		Replies	Views
ptxas register use CUDA Programming and Performance	5	1751	March 4, 2014
Why would recycling registers increase register count? CUDA Programming and Performance	1	580	September 10, 2018
Some questions about PTX and cubin. How to know where my registers are used for? CUDA Programming and Performance	3	15960	July 9, 2010
register usage according to the ptx file CUDA Programming and Performance	3	4242	June 26, 2009
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4637	April 5, 2007
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3168	June 3, 2015
Optimizing ptx CUDA Programming and Performance	10	8953	April 24, 2008
Output code uses far too many registers CUDA Programming and Performance	8	1592	February 5, 2017
ptx question CUDA Programming and Performance	4	3835	October 16, 2008
reducing unnecessary register spilling CUDA Programming and Performance	7	5339	November 9, 2010

ptxas optimization

Related topics