Hi,
I have a general problem of a kernel that uses too many registers (>20) to achieve a high occupancy.
Trying to get some grip on the register allocation/usage, so i’m looking at the ptx file for a very simple kernel:
[codebox]global void CudaSqrtKernel
(const float *p_CudaArrayIn,
float *p_CudaArrayOut,
const int p_Size)
{
int idx = (blockIdx.x * blockDim.x + threadIdx.x);
if (idx<p_Size)
{
p_CudaArrayOut[idx] = sqrt(p_CudaArrayIn[idx]);
}
}
[/codebox]
The .cubin file (and the profiler) show that only 3 registers are used (which makes sense).
But… when i look at the .ptx code, it seems many more are used, for instance r1 through r9…
[codebox] .entry _Z14CudaSqrtKernelPKfPfi (
.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn,
.param .u32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut,
.param .s32 __cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size)
{
.reg .u16 %rh<4>;
.reg .u32 %r<11>;
.reg .f32 %f<4>;
.reg .pred %p<3>;
.loc 14 7 0
$LBB1__Z14CudaSqrtKernelPKfPfi:
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.u32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
ld.param.s32 %r4, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_Size];
setp.le.s32 %p1, %r4, %r3;
@%p1 bra $Lt_0_1026;
.loc 14 12 0
mul.lo.u32 %r5, %r3, 4;
ld.param.u32 %r6, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayIn];
add.u32 %r7, %r6, %r5;
ld.global.f32 %f1, [%r7+0];
sqrt.approx.f32 %f2, %f1;
ld.param.u32 %r8, [__cudaparm__Z14CudaSqrtKernelPKfPfi_p_CudaArrayOut];
add.u32 %r9, %r8, %r5;
st.global.f32 [%r9+0], %f2;
$Lt_0_1026:
.loc 14 14 0
exit;
$LDWend__Z14CudaSqrtKernelPKfPfi:
} // _Z14CudaSqrtKernelPKfPfi
[/codebox]
It is clear that %r1 through %r4 are used for the first guard (checking for the range of the array). What i don’t understand is why these registers are not re-used for the calcuation of the sqrt(). Instead %r5 through %r9 are used. Or does this ptx file show an intermediate step that will later be further optimized?
Any help welcome,
Kind Regards,
Daniel Dekkers