register usage according to the ptx file

DoZo1971 · June 25, 2009, 9:50am

Hi,

I have a general problem of a kernel that uses too many registers (>20) to achieve a high occupancy.

Trying to get some grip on the register allocation/usage, so i’m looking at the ptx file for a very simple kernel:

[codebox]global void CudaSqrtKernel

(const float *p_CudaArrayIn,

float *p_CudaArrayOut,

const int p_Size)

{

int idx = (blockIdx.x * blockDim.x + threadIdx.x);

if (idx<p_Size)

{

	p_CudaArrayOut[idx] = sqrt(p_CudaArrayIn[idx]);

}

}

[/codebox]

The .cubin file (and the profiler) show that only 3 registers are used (which makes sense).

But… when i look at the .ptx code, it seems many more are used, for instance r1 through r9…

[codebox] .entry _Z14CudaSqrtKernelPKfPfi (

	.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn,

	.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut,

	.param .s32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size)

{

.reg .u16 %rh<4>;

.reg .u32 %r<11>;

.reg .f32 %f<4>;

.reg .pred %p<3>;

.loc	14	7	0

$LBB1__Z14CudaSqrtKernelPKfPfi:

mov.u16 	%rh1, %ctaid.x;

mov.u16 	%rh2, %ntid.x;

mul.wide.u16 	%r1, %rh1, %rh2;

cvt.u32.u16 	%r2, %tid.x;

add.u32 	%r3, %r2, %r1;

ld.param.s32 	%r4, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size];

setp.le.s32 	%p1, %r4, %r3;

@%p1 bra 	$Lt_0_1026;

.loc	14	12	0

mul.lo.u32 	%r5, %r3, 4;

ld.param.u32 	%r6, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn];

add.u32 	%r7, %r6, %r5;

ld.global.f32 	%f1, [%r7+0];

sqrt.approx.f32 	%f2, %f1;

ld.param.u32 	%r8, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut];

add.u32 	%r9, %r8, %r5;

st.global.f32 	[%r9+0], %f2;

$Lt_0_1026:

.loc	14	14	0

exit;

$LDWend__Z14CudaSqrtKernelPKfPfi:

} // _Z14CudaSqrtKernelPKfPfi

[/codebox]

It is clear that %r1 through %r4 are used for the first guard (checking for the range of the array). What i don’t understand is why these registers are not re-used for the calcuation of the sqrt(). Instead %r5 through %r9 are used. Or does this ptx file show an intermediate step that will later be further optimized?

Any help welcome,

Kind Regards,

Daniel Dekkers

cbuchner1 · June 25, 2009, 12:01pm

You just answered your own question.

DoZo1971 · June 26, 2009, 12:13pm

… right …

And is this optimization visible somewhere in semi-human-readable form?

MisterAnderson42 · June 26, 2009, 12:17pm

Use decuda on the cubin.

http://www.cs.rug.nl/~wladimir/decuda/

Topic		Replies	Views
Register usage and .ptx files CUDA Programming and Performance	2	6997	October 12, 2007
Register usage Understanding -ptx and -cubin CUDA Programming and Performance	11	5405	July 24, 2007
ptx question CUDA Programming and Performance	4	3846	October 16, 2008
Some questions about PTX and cubin. How to know where my registers are used for? CUDA Programming and Performance	3	15983	July 9, 2010
Got some questions about ptx code ptx code, cubin file, decuda, register usage CUDA Programming and Performance	1	3149	July 8, 2010
Modifying ptc code and compiling it possible ? CUDA Programming and Performance	5	7453	July 12, 2007
Optimizing ptx CUDA Programming and Performance	10	9105	April 24, 2008
Difference between the registers usage information showed in ptx file and cubin file CUDA Programming and Performance	4	1366	March 3, 2011
Understanding how the compiler assigns registers Checking the .cubin file CUDA Programming and Performance	4	3354	November 10, 2008
Maximum number of registers used How to get the block of code that uses the maximum number of regist CUDA Programming and Performance	6	4026	September 23, 2009

register usage according to the ptx file

Related topics