minimal example, high register usage

Consider these 3 trivial, minimal kernels. Their register usage is much higher than I expect. Why?

A:

__global__ void Kernel_A()
{  
//empty
}

corresponding ptx:

ptxas info    : Compiling entry function '_Z8Kernel_Av' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_Av
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 32 bytes cmem[0]

B:

template<uchar effective_bank_width>
__global__ void  Kernel_B()
{
//empty
}

template
__global__ void  Kernel_B<1>();

corresponding ptx:

ptxas info    : Compiling entry function '_Z8Kernel_BILh1EEvv' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_BILh1EEvv
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 32 bytes cmem[0]

C:

template<uchar my_val>
__global__ void  Kernel_C
        (uchar *const   device_prt_in, 
        uchar *const    device_prt_out)
{ 
//empty
}

corresponding ptx:

ptxas info    : Compiling entry function '_Z35 Kernel_CILh1EEvPhS0_' for 'sm_20'
ptxas info    : Function properties for _Z35 Kernel_CILh1EEvPhS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 48 bytes cmem[0]

Question:

Why did empty kernels A and B use 2 registers? CUDA always uses one implicit register, but why are 2 additional explicit registers used?

Kernel C is even more frustrating. 10 registers? But there are only 2 pointers. This gives 2*2 = 4 registers for the pointers. Even if there are additionally 2 mysterious registers (suggested by Kernel A and Kernel B), this would give 6 total. Still much less than 10 !


In case you are interested, here is the ptx code for Kernel A. The ptx code for Kernel B is exactly the same, modulo the integer values and variable names.

.visible .entry _Z8Kernel_Av(    
)
{           
        .loc 5 19 1
func_begin0:
        .loc    5 19 0

        .loc 5 19 1

func_exec_begin0:
        .loc    5 22 2
        ret;
tmp0:
func_end0:
}

And for Kernel C…

.weak .entry _Z35Kernel_CILh1EEvPhS0_(
        .param .u64 _Z35Kernel_CILh1EEvPhS0__param_0,
        .param .u64 _Z35Kernel_CILh1EEvPhS0__param_1
)
{
        .local .align 8 .b8     __local_depot2[16];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .s64       %rd<3>;


        .loc 5 38 1
func_begin2:
        .loc    5 38 0

        .loc 5 38 1

        mov.u64         %SPL, __local_depot2;
        cvta.local.u64  %SP, %SPL;
        ld.param.u64    %rd1, [_Z35Kernel_CILh1EEvPhS0__param_0];
        ld.param.u64    %rd2, [_Z35Kernel_CILh1EEvPhS0__param_1];
        st.u64  [%SP+0], %rd1;
        st.u64  [%SP+8], %rd2;
func_exec_begin2:
        .loc    5 836 2
tmp2:
        ret;
tmp3:
func_end2:
}
  1. Why does it first declare a local-memory variable (.local) ?
  2. Why are the two pointers (given as function arguments) stored in registers? Isn't there a special param space for them?
  3. Perhaps the two function argument pointers belong in registers - that explains the two .reg .b64 lines. But what is the .reg .s64 line? Why is it there?

It gets worse still:

D:

template<uchar my_val>
    __global__ void  Kernel_D
    		(uchar *   device_prt_in, 
    		uchar *const    device_prt_out)
    { 
        device_prt_in = device_prt_in + blockIdx.x*blockDim.x + threadIdx.x;
    }

gives

ptxas info    : Used 6 registers, 48 bytes cmem[0]

So manipulating the argument (a pointer) decreases from 10 to 6 registers?

The extra 2 registers are for the ABI. There is scant detail on the ABI implementation but I assume it’s just a basic call stack (you can roughly see that in the PTX).

You can switch off the ABI (for now) using the “-abi=no” option:

nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin empty.cu 
ptxas : warning : 'option -abi=no' might get deprecated in future
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z8Kernel_Av' for 'sm_35'
ptxas : info : Used 0 registers, 320 bytes cmem[0]

Also, it appears you’re using 64-bit addressing which means your pointers are 8 bytes.

What effect does “-abi=no” have on the application? Why is it there, why bother?

(and what is ABI?)

Fermi and Kepler support indirect calling and therefore a standard call/return stack is possible. An ABI describes a calling convention, rules for interfacing with libraries, etc.

Earlier architectures only support direct function calls. Imagine only having a “goto” and a kernel and all its functions are flattened into one sequence. It sounds primitive but many (most?) kernels (still?) fit this programming model.

The sm_35 architecture is one step beyond Fermi and Kepler (sm_20 & sm_30). Now kernels can invoke other kernels. So there is now effectively a call stack outside your kernel as well as inside.

It’s your choice (modulo what device you’re running on) what features you want to exploit.

You’re right. So -abi=no gets rid of 2 registers for my Kernel.

So if I am bound by register pressure, then -abi=no is a good option for me. Why might they deprecate this in the future? Would there be a way around this?

I’ve found that the only reason to use -abi=no is if I have a fully “forceinlined” and flattened kernel that’s extremely optimized and meant to fit into a 63 register limit. Two registers isn’t much though.

One gotcha when debugging a kernel that has -abi=no enabled is that you can’t link in printf() support. It’s good to remember that.

I hope it isn’t removed as it’s still useful in certain extreme optimization efforts.

If it’s removed, a workaround would be to write your own compiler. :)

The ABI on the GPU serves the same purpose ABIs for CPUs serve. Software features such as separate compilation, linking, libraries, dynamic parallelism, function calls through pointers, numerous C++ features all require the interoperability afforded by use of the ABI.

In general, most ABIs work by setting aside some registers for particular purposes and employing a stack frame. How heavily they rely on one or the other is usually a function of how many registers are available (e.g. 32-bit x86 relies heavily on the stack frame and reserves two out of the eight available registers, that is 25% of the registers). ABIs are what makes modern software work, and programmers typically cannot opt out of them. The only reason the CUDA tool chain offers an option to turn off the ABI is for historical reasons.

My recommendation is not to turn off the ABI, in particular not for any new development.

Can adding -abi=no option, while most of the functions in my code are inline, make the application gain performance wise?

MK