Recently, I ran into this out of resource error when I try to call my kernel for initializing a 2D random number generators. The kernel function is quite simple and straightforward, it is shown below:
__global__ void
kernel_setup_randstates_2d(curandState * states, // output:the array of all the random states
int w, int h, // input: the width, height of the 2D field
const unsigned long long & seed // input: the seed
)
{
int i = blockDim.y*blockIdx.y + threadIdx.y;
int j = blockDim.x*blockIdx.x + threadIdx.x;
if (i < h && j < w)
{
int id = i*w + j; // the ID of the element
// Each thread gets same seed, a different sequence number, no offset
curand_init(seed, id, 0, &states[id]);
}
}
The ptxas info for both the debug and release build are:
debug (maxrregcount = 0):
1> ptxas info : Compiling entry function ‘_Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy’ for ‘sm_61’
1> ptxas info : Function properties for _Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy
1> 6488 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 105 registers, 6488 bytes cumulative stack size, 344 bytes cmem[0]
release (maxrregcount = 0):
1> ptxas info : Compiling entry function ‘_Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy’ for ‘sm_61’
1> ptxas info : Function properties for _Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy
1> 6440 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 32 registers, 344 bytes cmem[0], 4 bytes cmem[2]
And my OS is windows 10 64-bit, my GPU is a desktop GTX 1050Ti.
The thread block size I used is 1024 with dim(32, 32, 1), I can totally understand why the out of resource error occur in debug build, since the maximum registers per block limit is 64K, which means I can apply at most 64 registers in the kernel given the block size is 1024, and the kernel actually applied 105 registers instead.
Which confuses me is how come the out of resource error still occur in release build, even if the register usage is only 32?
The other thing is the high usage of stack frame, why can a simple function like this take up so many bytes of stack frame?
Actually I have already found the cure to this error, is by setting -maxrregcount to a nonzero number, it works, but I just want to know why.