Anyway to reduce stack frame usage

I come across the stack frame issue in my code. I wrote a simple test code to see if the stack frame usage could be reduced using a smaller block size. However, its usage is still the same with a smaller block size. Could anyone advise how to reduce that? Here is my test code.

#include <iostream>
  
   using namespace std;
  
   __global__ void kernel(int *res){
     int val[32];
     for(int i = 0; i < 32; i++)
       val[i] = i;
   #pragma unroll
  //  for (int i = 0; i < 32; i++)
  //    val[i] += __shfl_xor_sync(0xffffffff, val[i], 16);
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
    res[threadIdx.x] = val[threadIdx.x];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }

I compile the code with the following command:
nvcc -maxrregcount=64 -arch=sm_70 -maxrregcount=64 red.cu -Xptxas -v
The output of the compilation is
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z6kernelPi’ for ‘sm_70’
ptxas info : Function properties for _Z6kernelPi
128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 360 bytes cmem[0]

Why is the static array int val[32] allocated on the stack? How could I force it to be put in registers?

Thank you.

It is not clear why you are concerned about the relative minor stack usage. In order for small thread-local arrays to be mapped to registers, they must be:

(1) sufficiently small (as determined by a compiler heuristic; register pressure is often a performance concern)
(2) all array indexing must be compile-time constant (as registers are not addressable in the manner of memory)

These are necessary, but not necessarily sufficient conditions, i.e. other restrictions may apply.

Since you have manually specified a limit of 64 registers per thread, and the code without optimization of val into registers uses 36 registers per thread, and val would require an additional 32 registers, it seems evident to me that the compiler may choose not to optimize val into registers. 36+32 = 68 > 64

stack frame usage (i.e. the memory allocation required to support stack frame usage at kernel launch time) can be driven by the active thread capacity of the GPU (2048*#ofSMs), as opposed to the dimensions of your grid or threadblock. This can be inferred from njuffa’s statement here:

https://devtalk.nvidia.com/default/topic/642743/what-is-the-maximum-cuda-stack-frame-size-per-kerenl-/

and if you think about it carefully, you will see why it must be the case. For this reason, stack frame usage may not change as you change grid dimensions.

suggestions:

  1. don’t put line numbers in your code; if you need to refer to a specific line, put a inline comment there
  2. use the code formatting button </> in the tool bar at the top of the text entry window, to format your code. Just select all the code then click that button.

Here I just show a test code. Nvprof reports many local memory transactions in my original code, while nvcc does not report spills. I guess the local memory transaction must come from the stack frames. This is the reason why I want to put the array in registers.

Thank you for your reply. Even I set the maxrregcount option to 100, I could still see stack frame usages. Here is the report of new compilation by running

nvcc -maxrregcount=100 -arch=sm_70 red.cu -Xptxas -v -O3
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPi' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPi
    128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 36 registers, 360 bytes cmem[0]

You mentioned in the reply that the total 68 registers are going to be used if the val array is allocated in the register. I checked the cubin file of the test code. The code uses registers up to R33. How come the 36 registers reported from nvcc? Why does the test code use so many registers with the static array val allocated on the stack? Thanks again for your help.

As njuffa indicated, one of the requirements for “registerization” (is that a word?) of a stack-based array is compile-time computability/discoverability of indices. I’m reasonably sure that this statement:

res[threadIdx.x] = val[threadIdx.x];

is preventing registerization of your val array. threadIdx.x is not known at compile time, and the compiler has no way to provide an array value based on that index, except by ordinary indexing into memory.

The following modification of your code, for example, shows “registerization”:

$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i;
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPi' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 360 bytes cmem[0]
$

The fact that the register usage is not even 32, indicates that there is more than just simple “registerization” happening here. In addition, the compiler is presumably reordering code and presumably reusing registers, to get the register count down to 19. You could probably confirm this with SASS analysis and the register “liveness” view provided by nvdisasm:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

In fact there may be wholesale collapsing of code functionality. The output is easily precomputed for the above code. The following slight modification to the initialization of the data brings the register count up to 40:

$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res, int *off){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i + off[i];
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res, d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 368 bytes cmem[0]
$

I suppose one possibility is that the compiler (ptxas) may have a register allocation granularity, which may be arch-dependent.

Thank you, Robert, for your insightful explanation.