kernel launch error: 'too many resources requested for launch'

Recently, I ran into this out of resource error when I try to call my kernel for initializing a 2D random number generators. The kernel function is quite simple and straightforward, it is shown below:

__global__ void 
kernel_setup_randstates_2d(curandState * states,	// output:the array of all the random states
						   int w, int h,			// input: the width, height of the 2D field
						   const unsigned long long & seed	// input: the seed
						   )
{
	int i = blockDim.y*blockIdx.y + threadIdx.y;
	int j = blockDim.x*blockIdx.x + threadIdx.x;

	if (i < h && j < w)
	{
		int id = i*w + j; // the ID of the element

		// Each thread gets same seed, a different sequence number, no offset
		curand_init(seed, id, 0, &states[id]);
	}
}

The ptxas info for both the debug and release build are:

debug (maxrregcount = 0):
1> ptxas info : Compiling entry function ‘_Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy’ for ‘sm_61’
1> ptxas info : Function properties for _Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy
1> 6488 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 105 registers, 6488 bytes cumulative stack size, 344 bytes cmem[0]

release (maxrregcount = 0):
1> ptxas info : Compiling entry function ‘_Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy’ for ‘sm_61’
1> ptxas info : Function properties for _Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy
1> 6440 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 32 registers, 344 bytes cmem[0], 4 bytes cmem[2]

And my OS is windows 10 64-bit, my GPU is a desktop GTX 1050Ti.

The thread block size I used is 1024 with dim(32, 32, 1), I can totally understand why the out of resource error occur in debug build, since the maximum registers per block limit is 64K, which means I can apply at most 64 registers in the kernel given the block size is 1024, and the kernel actually applied 105 registers instead.

Which confuses me is how come the out of resource error still occur in release build, even if the register usage is only 32?

The other thing is the high usage of stack frame, why can a simple function like this take up so many bytes of stack frame?

Actually I have already found the cure to this error, is by setting -maxrregcount to a nonzero number, it works, but I just want to know why.

Can somebody help me with this problem? Please

Can somebody explain these problems to me? Please

I’ve built a simple app around the code you have shown, and have no trouble launching threadblocks of 1024 threads in release mode. So I can’t really reproduce your observations/claims and am skeptical of them.

The best guess I can come up with is that your “release” build is still indicating -G for debug compilation.

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
$ cat t136.cu
#include <curand_kernel.h>
#define CNT 1024
__global__ void
kernel_setup_randstates_2d(curandState * states,        // output:the array of all the random states
                                                   int w, int h,                        // input: the width, height of the 2D field
                                                   const unsigned long long  seed       // input: the seed
                                                   )
{
        int i = blockDim.y*blockIdx.y + threadIdx.y;
        int j = blockDim.x*blockIdx.x + threadIdx.x;

        if (i < h && j < w)
        {
                int id = i*w + j; // the ID of the element

                // Each thread gets same seed, a different sequence number, no offset
                curand_init(seed, id, 0, &states[id]);
        }
}

int main(){

  curandState *d_states;
  cudaMalloc(&d_states, CNT * sizeof(curandState));
  kernel_setup_randstates_2d<<<1,CNT>>>(d_states, 1,1, 1);
  cudaDeviceSynchronize();
}
$ nvcc -Xptxas -v -arch=sm_61 -o t136 t136.cu -lcurand
ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy' for 'sm_61'
ptxas info    : Function properties for _Z26kernel_setup_randstates_2dP17curandStateXORWOWiiy
    6440 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 344 bytes cmem[0], 4 bytes cmem[2]
$ cuda-memcheck ./t136
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof --print-gpu-trace ./t136
==26152== NVPROF is profiling process 26152, command: ./t136
==26152== Profiling application: ./t136
==26152== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
547.96ms  2.9440us              (1 1 1)      (1024 1 1)        32        0B        0B         -           -  TITAN X (Pascal         1         7  kernel_setup_randstates_2d(curandStateXORWOW*, int, int, unsigned __int64) [394]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

By the way, unless you are using managed memory, reference parameters are illegal in a kernel prototype.

https://devtalk.nvidia.com/default/topic/546279/cuda-programming-and-performance/is-passing-by-reference-allowed-/post/3830017/#3830017

So I made that one change to your kernel code.

Thank you very much, txbob, it turns out that the kernel can actually run in release build, the reason why the result is still wrong is that I passed argument by reference, a completely different error, just like you reminded. And now I finally get the right result in release build.