How to let nvcc use more registers

On A100, I allocate 100KB smem per SM for kernels. Here is compile info:

ptxas info    : Compiling entry function 'gemm_reg_mma_m16n8k16_pts_256' for 'sm_80'
ptxas info    : Function properties for gemm_reg_mma_m16n8k16_pts_256
    256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 60 registers, 24800 bytes smem, 404 bytes cmem[0]

And here is my compile instruction

nvcc -o gemm.so --shared -Xcompiler -fPIC -O3 -arch=sm_80 -maxrregcount=128 --ptxas-options=-v gemm.cu

Here is the place I declare registers in my kernel:

        __half2 tmp_kernel[4][4];
        __half2 in_feats[2];
        __half2 out_feats[64];
        __half2 src;
        __half2 dst;
        __half2 result[2];

It seems like all out_feats are put on the stack, how can I move them into registers?

I think I have at least 68 registers because the register limit is 128. Why don’t compiler put 256B on stack instead of in register?

Actually, I have use many functions in my kernel to manipulate these __half2 type data, like __halves2half2, __high2half, __low2half. I also have many address calculation. But I still can’t understand the way the compiler deal with my data.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.