register usage according to the ptx file

Hi,

I have a general problem of a kernel that uses too many registers (>20) to achieve a high occupancy.

Trying to get some grip on the register allocation/usage, so i’m looking at the ptx file for a very simple kernel:

[codebox]global void CudaSqrtKernel

(const float *p_CudaArrayIn,

float *p_CudaArrayOut,

const int p_Size)

{

int idx = (blockIdx.x * blockDim.x + threadIdx.x);

if (idx<p_Size)

{

	p_CudaArrayOut[idx] = sqrt(p_CudaArrayIn[idx]);

}

}

[/codebox]

The .cubin file (and the profiler) show that only 3 registers are used (which makes sense).

But… when i look at the .ptx code, it seems many more are used, for instance r1 through r9…

[codebox] .entry _Z14CudaSqrtKernelPKfPfi (

	.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn,

	.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut,

	.param .s32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size)

{

.reg .u16 %rh<4>;

.reg .u32 %r<11>;

.reg .f32 %f<4>;

.reg .pred %p<3>;

.loc	14	7	0

$LBB1__Z14CudaSqrtKernelPKfPfi:

mov.u16 	%rh1, %ctaid.x;

mov.u16 	%rh2, %ntid.x;

mul.wide.u16 	%r1, %rh1, %rh2;

cvt.u32.u16 	%r2, %tid.x;

add.u32 	%r3, %r2, %r1;

ld.param.s32 	%r4, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size];

setp.le.s32 	%p1, %r4, %r3;

@%p1 bra 	$Lt_0_1026;

.loc	14	12	0

mul.lo.u32 	%r5, %r3, 4;

ld.param.u32 	%r6, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn];

add.u32 	%r7, %r6, %r5;

ld.global.f32 	%f1, [%r7+0];

sqrt.approx.f32 	%f2, %f1;

ld.param.u32 	%r8, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut];

add.u32 	%r9, %r8, %r5;

st.global.f32 	[%r9+0], %f2;

$Lt_0_1026:

.loc	14	14	0

exit;

$LDWend__Z14CudaSqrtKernelPKfPfi:

} // _Z14CudaSqrtKernelPKfPfi

[/codebox]

It is clear that %r1 through %r4 are used for the first guard (checking for the range of the array). What i don’t understand is why these registers are not re-used for the calcuation of the sqrt(). Instead %r5 through %r9 are used. Or does this ptx file show an intermediate step that will later be further optimized?

Any help welcome,

Kind Regards,

Daniel Dekkers

You just answered your own question.

… right …

And is this optimization visible somewhere in semi-human-readable form?

Use decuda on the cubin.

http://www.cs.rug.nl/~wladimir/decuda/