Why compiler don't use registers to store my data?

I write a kernel on A100 and scheduled to use less then 128 registers per thread (cause I allocate 100KB smem per SM and use 25KB per threadblock, each has 4 warps). However, the compiler use stack frame instead of registers to store my data. Here is my declaration for registers:

        __half2 tmp_kernel[4][4];
        __half2 in_feats[2];
        __half2 out_feats[64];
        __half2 src;
        __half2 dst;
        __half2 result[2];

here is compile info

ptxas info    : Compiling entry function 'gemm_reg_mma_m16n8k16_pts_256' for 'sm_80'
ptxas info    : Function properties for gemm_reg_mma_m16n8k16_pts_256
    256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 60 registers, 24800 bytes smem, 404 bytes cmem[0]

and here is my compile instruction

nvcc -o gemm.so --shared -Xcompiler -fPIC -O3 -arch=sm_80 -maxrregcount=128 --ptxas-options=-v gemm.cu

I find that out_feats are allocated on the stack.

Is there any way to allocate these data in registers?

Can you find out, whether the reason is your limitation of the number of registers (even if 68 are free) or (what remains) your accesses?

Are any of the indices no compile-time constants or different between the threads? E.g. they depend on threadIdx or a for loop iteration variable? For the latter case, do you unroll the loops? Or do you take pointers to array elements?

Sometimes the reason is also parameter passing to device functions, in this case I would use __forceinline__.

I’ve try to set the limit to 224, but still get the same info. I haven’t unroll any loop.

But I have many variables depend on threadIdx, like below:

  __half2 out_feats[64];
  int transfer_pal = (threadIdx.x % 2 == 1) ? (threadIdx.x - 1) : (threadIdx.x + 1);
  ......
  src = __halves2half2(((pal_idx_2 % 2 == 0) ? __high2half(out_feats[pal_idx_2 / 2]) : __low2half(out_feats[pal_idx_2 / 2])),
                       ((pal_idx_1 % 2 == 0) ? __high2half(out_feats[pal_idx_1 / 2]) : __low2half(out_feats[pal_idx_1 / 2])));
  dst = __halves2half2(((own_idx_2 % 2 == 0) ? __high2half(out_feats[own_idx_2 / 2]) : __low2half(out_feats[own_idx_2 / 2])),
                       ((own_idx_1 % 2 == 0) ? __high2half(out_feats[own_idx_1 / 2]) : __low2half(out_feats[own_idx_1 / 2])));
  
  result[0] = (threadIdx.x % 2 == 0) ? (dst) : (__shfl_sync(0xffffffff, src, transfer_pal));
  result[1] = (threadIdx.x % 2 == 1) ? (dst) : (__shfl_sync(0xffffffff, src, transfer_pal));
  ......

pal_idx_1, pal_idx_2, own_idx_1, own_idx_2 are all related to threadIdx, they need to be calculated by data in smem and threadIdx.

I don’t understand the meaning of “Or do you take pointers to array elements?”. Could you please give me an example? I don’t understand “parameter passing to device functions” too.
When I commentted all the lines related to out_feats, the use of stack frame is go to 9.

If pal_idx_1/2 or own_idx_1/2 are dependent on threadIdx, you cannot use registers. The register indices have to be known at compile-time and have to be the same for all threads. They are directly encoded into the instruction. Also there are 2 (?) register banks and ptxas has to take care that they are not accessed at the same time (could also be that a concurrent access is possible, but slows down).

Either rewrite the algorithm, so they are not dependent on threadIdx any longer,
or use __shfl_sync to load by index (using thread index 0…31)
or use shared memory (even if you are already limited)
or use global or local memory and rely on the L1 cache (which is difficult to do in a coalesced way, if all threads have different indices)
or try to rework the algorithm to use the tensor cores for resorting data.

I don’t understand the request of shfl.sync() yet. I used to believe that it can work without local memory if the identifier in the function is determined. So I move the data into determined identifier src.
In my opinion, src is a determined register, it just have different value at different times.
Do you mean that the index must be determined? Not just src?

You do not have to use __shfl_sync. It is a tool to get data by dynamic indices:

You have 32 src values, one for each of 32 threads of a warp.
With __shfl_sync each thread can dynamically select one of the values (by choosing from which thread we read from).
This is possible without local memory.

I don’t get it. What confused me most is that I don’t know why if pad_idx / 2 & own_idx_1 / 2 are dependent on threadIdx, the compiler will choose to use local memory.
I mentioned shfl.sync() just because I guess maybe the dependency conflict with the use of this function.
Do you mean that shfl.sync() will use local memory anyway ?
“The register indices have to be known at compile-time and have to be the same for all threads. They are directly encoded into the instruction” I’m sorry, I still don’t understand which function has this requirement?

The compiler has to be able to convert an array into individual variables.

I.e. out_feats[idx] into
half2 out_feats0, out_feats1, out_feats2, ...

If this is not possible, it has to use local memory.
The variable has to be the same for each thread.

If you use calculated indices for idx and the compiler cannot reduce it to a constant number, it cannot point to a specific variable.

Whereas with shfl.sync() you can load data from any of the 32 threads (your own thread or the 31 others) and you can use a dynamic calculated number to specify from which thread.

I get it. So it is hard to use registers to store the mma operation results as I need to know the absolute index of register I need each time, which can be hard especially in sparse conditions.

Thanks you so much! I will try to change my algorithm.

I still have one question about shfl.sync(). If many threads require data from a same thread, will it cost much more time then each thread requires data from different thread?

If your input matrix is sparse, you can know beforehand or resort rows/columns to only use non-zero elements.

If your output matrix is sparse, I would just save everything or use an algorithm with shared memory to detect empty elements.
If your output matrix is not fully needed and you know the elements beforehand, you can try to create a hard-coded algorithm, but it will be difficult (without shared memory), as each thread has different matrix elements after mma.

No, shfl.sync() will always have the same speed regardless, whether you use it for broadcasting or each value is read once or something in between.

Thanks!

I have another question, will the following kernel use local memory? (assume that there are still idle registers in the SM)

__device__ f(int* data) { // idx is calculated dynamically
  int A[32];
  int dst;
  ...
  for (int i = 0; i < 16; i++) {
      if (data[i] % 2 == 0) {
          dst = __shfl_sync(0xffffffff, A[i], idx);
      } else {
          dst = __shfl_sync(0xffffffff, A[i + 16], idx);
      }
  }
}

I don’t know if this can be seen as use direct register.

Probably not, as long as you use #pragma unroll directly in front of the for loop

I want to write code in this way because I still want to use registers to store the value of points’ different channels. I need to store 128 points with 16 channels per warp, so I need to store 4 points with 16 channels per thread. Then, before each mma operation, I can use shfl.sync to get the val I need from particular threads.