On A100, I allocate 100KB smem per SM for kernels. Here is compile info:
ptxas info : Compiling entry function 'gemm_reg_mma_m16n8k16_pts_256' for 'sm_80'
ptxas info : Function properties for gemm_reg_mma_m16n8k16_pts_256
256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 60 registers, 24800 bytes smem, 404 bytes cmem[0]
And here is my compile instruction
nvcc -o gemm.so --shared -Xcompiler -fPIC -O3 -arch=sm_80 -maxrregcount=128 --ptxas-options=-v gemm.cu
Here is the place I declare registers in my kernel:
__half2 tmp_kernel[4][4];
__half2 in_feats[2];
__half2 out_feats[64];
__half2 src;
__half2 dst;
__half2 result[2];
It seems like all out_feats are put on the stack, how can I move them into registers?