I come across the stack frame issue in my code. I wrote a simple test code to see if the stack frame usage could be reduced using a smaller block size. However, its usage is still the same with a smaller block size. Could anyone advise how to reduce that? Here is my test code.
#include <iostream>
using namespace std;
__global__ void kernel(int *res){
int val[32];
for(int i = 0; i < 32; i++)
val[i] = i;
#pragma unroll
// for (int i = 0; i < 32; i++)
// val[i] += __shfl_xor_sync(0xffffffff, val[i], 16);
for (int i = 0; i < 16; i++)
val[i] += val[i+16];
res[threadIdx.x] = val[threadIdx.x];
}
int main(){
int *h_res=new int[32];
int *d_res;
cudaMalloc((void **)&d_res, 32*sizeof(int));
kernel<<<1,32>>>(d_res);
cudaMemcpy(h_res, d_res, sizeof(int)*32, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
cudaFree(d_res);
delete [] h_res;
return 0;
}
I compile the code with the following command:
nvcc -maxrregcount=64 -arch=sm_70 -maxrregcount=64 red.cu -Xptxas -v
The output of the compilation is
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z6kernelPi’ for ‘sm_70’
ptxas info : Function properties for _Z6kernelPi
128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 360 bytes cmem[0]
Why is the static array int val[32] allocated on the stack? How could I force it to be put in registers?
Thank you.