Finding .local variables from ptx is there a way ? please help

Is there is way from looking at ptx as to which variable has been defined as local…

after inspecting the ptx… i understood .local is the way it defines a local variable but it doesn’t tell which variable name is it. Is there any way I can find this out… ?

thanks all very much.

  • adding to the above…is there anyway to restrict local-memory usage by making it use more register ? as my code uses 56bytes of local memory and 31 registers… I dont mind it using moe registers just to see the performance benifit.

Please anyone can help… :unsure: ?

thank you all …

do you declare array in your kernel ?

if so, usually it is put into local memory.

yup am aware of the pitfalls like that… but am not declaring any array … just two float 4 structs rest floats and ints… I want to find out which variables are going to local memory . Even though its just 56 bytes but its still significant so I was thinking is there a way by which you can find in ptx which variable corresponds to which register or .local ?

thanks for the help…

PTX does support .local declarations. However, these are not typically used for register spills because PTX works with an infinite register set that represents what a program would look like before register allocation. You would have to disassemble a .cubin file or have a tool that actually performs register allocation at the PTX level. I am not sure whether or not such a tool exists…

EDIT: Actually if anyone could use such a tool, I have considered writing one for a little while now. It would not produce exactly the same results as nvcc, but it would probably be pretty close.

EDIT2: After re-reading your post, I don’t think that knowing exactly which variable is being spilled would help very much in determining how to reduce the number of spilled variables. Most compilers try to assign all variables to registers. There typically are not enough registers to hold all variables in the program, so the compiler will try to intelligently reuse registers for multiple variables. It can do this safely, assign the same register to two or more variables, as long as the variables are not alive – holding a value that may eventually be used – at the same time. When many variables are alive at the same time, the compiler may run out of registers to hold them all, and in this case it will try to spill them to memory intelligently so as to minimize the expected number of spill operations during execution. The point is that exactly which variable is chosen to be spilled to memory will be chosen based on not only what happens to that variable, but also what the program is doing with other variables.

For the .cubin disassembly case, you can use:

http://wiki.github.com/laanwj/decuda

Sorry to resurrect a dead post, but I recently went back and wrote a tool that performs register allocation on PTX. It is included as part of Ocelot http://code.google.com/p/gpuocelot/source/checkout (the tool is PTXOptimizer)

Here is an example of a ptx program before

.entry _Z17k_simple_sequencePi (

		.param .u32 __cudaparm__Z17k_simple_sequencePi_A)

	{

	.reg .u16 %rh<4>;

	.reg .u32 %r<10>;

	.loc	15	12	0

$LBB1__Z17k_simple_sequencePi:

	.loc	15	14	0

	mov.u16 	%rh1, %ctaid.x;

	mov.u16 	%rh2, %ntid.x;

	mul.wide.u16 	%r1, %rh1, %rh2;

	cvt.u32.u16 	%r2, %tid.x;

	add.u32 	%r3, %r2, %r1;

	mul.lo.s32 	%r4, %r3, 2;

	add.s32 	%r5, %r4, 1;

	ld.param.u32 	%r6, [__cudaparm__Z17k_simple_sequencePi_A];

	mul.lo.u32 	%r7, %r3, 4;

	add.u32 	%r8, %r6, %r7;

	st.global.s32 	[%r8+0], %r5;

	.loc	15	15	0

	exit;

$LDWend__Z17k_simple_sequencePi:

	} // _Z17k_simple_sequencePi

and after

.entry _Z17k_simple_sequencePi(.param  .u32 __cudaparm__Z17k_simple_sequencePi_A)

{

	.local .u8 _Zocelot_linear_scan_register_allocation_stack[20];

	.reg .u16 %r1;

	.reg .u16 %r0;

	.reg .u32 %r4;

	.reg .u32 %r2;

	.reg .u32 %r3;

	$BB_1_1:				/* $LBB1__Z17k_simple_sequencePi */ 

		mov.u16 %r1, %ctaid.x;

		mov.u16 %r0, %ntid.x;

		mul.wide.u16 %r4, %r1, %r0;

		st.local.u32 [_Zocelot_linear_scan_register_allocation_stack], %r4;

		cvt.u32.u16 %r2, %tid.x;

		st.local.u32 [_Zocelot_linear_scan_register_allocation_stack + 4], %r2;

		ld.reg.s32 %r2, [_Zocelot_linear_scan_register_allocation_stack + 4];

		ld.reg.s32 %r3, [_Zocelot_linear_scan_register_allocation_stack];

		add.u32 %r0, %r2, %r3;

		mul.lo.s32 %r1, %r0, 2;

		add.s32 %r3, %r1, 1;

		st.local.u32 [_Zocelot_linear_scan_register_allocation_stack + 8], %r3;

		ld.param.u32 %r2, [__cudaparm__Z17k_simple_sequencePi_A];

		st.local.u32 [_Zocelot_linear_scan_register_allocation_stack + 12], %r2;

		mul.lo.u32 %r1, %r0, 4;

		ld.reg.s32 %r2, [_Zocelot_linear_scan_register_allocation_stack + 12];

		add.u32 %r4, %r2, %r1;

		st.local.u32 [_Zocelot_linear_scan_register_allocation_stack + 16], %r4;

		ld.reg.s32 %r2, [_Zocelot_linear_scan_register_allocation_stack + 8];

		ld.reg.s32 %r3, [_Zocelot_linear_scan_register_allocation_stack + 16];

		st.global.s32 [%r3 + 0], %r2;

		exit;

}

allocation. This assumes that the max reg count is set to 5 registers.

Wohoo :) … looks good… I will try to use this… and get back// thanks…