Anyway to reduce stack frame usage

palebluedot · November 28, 2018, 10:21pm

I come across the stack frame issue in my code. I wrote a simple test code to see if the stack frame usage could be reduced using a smaller block size. However, its usage is still the same with a smaller block size. Could anyone advise how to reduce that? Here is my test code.

#include <iostream>
  
   using namespace std;
  
   __global__ void kernel(int *res){
     int val[32];
     for(int i = 0; i < 32; i++)
       val[i] = i;
   #pragma unroll
  //  for (int i = 0; i < 32; i++)
  //    val[i] += __shfl_xor_sync(0xffffffff, val[i], 16);
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
    res[threadIdx.x] = val[threadIdx.x];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }

I compile the code with the following command:
nvcc -maxrregcount=64 -arch=sm_70 -maxrregcount=64 red.cu -Xptxas -v
The output of the compilation is
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z6kernelPi’ for ‘sm_70’
ptxas info : Function properties for _Z6kernelPi
128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 360 bytes cmem[0]

Why is the static array int val[32] allocated on the stack? How could I force it to be put in registers?

Thank you.

njuffa · November 28, 2018, 10:33pm

It is not clear why you are concerned about the relative minor stack usage. In order for small thread-local arrays to be mapped to registers, they must be:

(1) sufficiently small (as determined by a compiler heuristic; register pressure is often a performance concern)
(2) all array indexing must be compile-time constant (as registers are not addressable in the manner of memory)

These are necessary, but not necessarily sufficient conditions, i.e. other restrictions may apply.

Robert_Crovella · November 28, 2018, 10:43pm

Since you have manually specified a limit of 64 registers per thread, and the code without optimization of val into registers uses 36 registers per thread, and val would require an additional 32 registers, it seems evident to me that the compiler may choose not to optimize val into registers. 36+32 = 68 > 64

stack frame usage (i.e. the memory allocation required to support stack frame usage at kernel launch time) can be driven by the active thread capacity of the GPU (2048*#ofSMs), as opposed to the dimensions of your grid or threadblock. This can be inferred from njuffa’s statement here:

[url]https://devtalk.nvidia.com/default/topic/642743/what-is-the-maximum-cuda-stack-frame-size-per-kerenl-/[/url]

and if you think about it carefully, you will see why it must be the case. For this reason, stack frame usage may not change as you change grid dimensions.

Robert_Crovella · November 28, 2018, 10:55pm

suggestions:

don’t put line numbers in your code; if you need to refer to a specific line, put a inline comment there
use the code formatting button </> in the tool bar at the top of the text entry window, to format your code. Just select all the code then click that button.

palebluedot · November 29, 2018, 1:51am

Here I just show a test code. Nvprof reports many local memory transactions in my original code, while nvcc does not report spills. I guess the local memory transaction must come from the stack frames. This is the reason why I want to put the array in registers.

palebluedot · November 29, 2018, 2:03am

Robert_Crovella:

Since you have manually specified a limit of 64 registers per thread, and the code without optimization of val into registers uses 36 registers per thread, and val would require an additional 32 registers, it seems evident to me that the compiler may choose not to optimize val into registers. 36+32 = 68 > 64

stack frame usage (i.e. the memory allocation required to support stack frame usage at kernel launch time) can be driven by the active thread capacity of the GPU (2048*#ofSMs), as opposed to the dimensions of your grid or threadblock. This can be inferred from njuffa’s statement here:

https://devtalk.nvidia.com/default/topic/642743/what-is-the-maximum-cuda-stack-frame-size-per-kerenl-/

and if you think about it carefully, you will see why it must be the case. For this reason, stack frame usage may not change as you change grid dimensions.

Thank you for your reply. Even I set the maxrregcount option to 100, I could still see stack frame usages. Here is the report of new compilation by running

nvcc -maxrregcount=100 -arch=sm_70 red.cu -Xptxas -v -O3

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPi' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPi
    128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 36 registers, 360 bytes cmem[0]

You mentioned in the reply that the total 68 registers are going to be used if the val array is allocated in the register. I checked the cubin file of the test code. The code uses registers up to R33. How come the 36 registers reported from nvcc? Why does the test code use so many registers with the static array val allocated on the stack? Thanks again for your help.

Robert_Crovella · November 29, 2018, 2:09am

As njuffa indicated, one of the requirements for “registerization” (is that a word?) of a stack-based array is compile-time computability/discoverability of indices. I’m reasonably sure that this statement:

res[threadIdx.x] = val[threadIdx.x];

is preventing registerization of your val array. threadIdx.x is not known at compile time, and the compiler has no way to provide an array value based on that index, except by ordinary indexing into memory.

The following modification of your code, for example, shows “registerization”:

$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i;
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPi' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 360 bytes cmem[0]
$

The fact that the register usage is not even 32, indicates that there is more than just simple “registerization” happening here. In addition, the compiler is presumably reordering code and presumably reusing registers, to get the register count down to 19. You could probably confirm this with SASS analysis and the register “liveness” view provided by nvdisasm:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

In fact there may be wholesale collapsing of code functionality. The output is easily precomputed for the above code. The following slight modification to the initialization of the data brings the register count up to 40:

$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res, int *off){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i + off[i];
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res, d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 368 bytes cmem[0]
$

Robert_Crovella · November 29, 2018, 2:20am

I suppose one possibility is that the compiler (ptxas) may have a register allocation granularity, which may be arch-dependent.

palebluedot · November 29, 2018, 2:54am

Robert_Crovella:

As njuffa indicated, one of the requirements for “registerization” (is that a word?) of a stack-based array is compile-time computability/discoverability of indices. I’m reasonably sure that this statement:
res[threadIdx.x] = val[threadIdx.x];
is preventing registerization of your val array. threadIdx.x is not known at compile time, and the compiler has no way to provide an array value based on that index, except by ordinary indexing into memory.

The following modification of your code, for example, shows “registerization”:
$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i;
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPi' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 360 bytes cmem[0]
$
The fact that the register usage is not even 32, indicates that there is more than just simple “registerization” happening here. In addition, the compiler is presumably reordering code and presumably reusing registers, to get the register count down to 19. You could probably confirm this with SASS analysis and the register “liveness” view provided by nvdisasm:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

In fact there may be wholesale collapsing of code functionality. The output is easily precomputed for the above code. The following slight modification to the initialization of the data brings the register count up to 40:
$ cat t344.cu
#include <iostream>

   using namespace std;

   __global__ void kernel(int *res, int *off){
     int val[32];
   #pragma unroll 32
     for(int i = 0; i < 32; i++)
       val[i] = i + off[i];
   #pragma unroll 16
    for (int i = 0; i < 16; i++)
      val[i] += val[i+16];
   #pragma unroll 32
    for (int i = 0; i < 32; i++)
      res[threadIdx.x+i] = val[i];
  }
  int main(){
    int *h_res=new int[32];
    int *d_res;
    cudaMalloc((void **)&d_res, 32*sizeof(int));
    kernel<<<1,32>>>(d_res, d_res);
     cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cudaFree(d_res);
    delete [] h_res;
    return 0;
 }
$ nvcc -arch=sm_70 -Xptxas=-v -c t344.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPiS_' for 'sm_70'
ptxas info    : Function properties for _Z6kernelPiS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 368 bytes cmem[0]
$

Thank you, Robert, for your insightful explanation.

Topic		Replies	Views
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3180	June 3, 2015
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	328	April 3, 2024
Register usage of a device function for vector rotation CUDA Programming and Performance	14	700	June 12, 2022
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1369	March 25, 2022
call stack/ stack frame CUDA Programming and Performance	28	9115	October 10, 2014
Tricks to fight register pressure or how I got down from 29 to 15 registers. CUDA Programming and Performance	14	16328	March 14, 2022
Changing one bit in mask slows down code significantly CUDA Programming and Performance	12	721	January 20, 2022
Ways to reduce number of registers CUDA Programming and Performance	14	19914	January 16, 2008
Overblown Stack Frame CUDA NVCC Compiler	5	32	January 17, 2025
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	88	December 7, 2024

Anyway to reduce stack frame usage

Related topics