Hey there,
Apparently, I’m having sort of a problem with resources. My ancient GPU (8600M, capability 1.1) refuses to launch a kernel.
First, let me mention an example, where everything is fine…
[b]
number of threads: 128 * 128
blockDim[8, 16, 1]
gridDim[16, 8, 1]
compile info: Used 59 registers, 668+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]
The kernel is quite big and each thread uses a stack to traverse a tree. That is the reason for the high number of bytes in local memory. The number of registers is embarrassingly huge, but I still haven’t proceed to the stage of optimization. The important fact is, that this configuration works just fine.
The problem comes up when I increase the total number of threads to 256 * 256.
[b]
number of threads: 256 * 256
blockDim[8, 16, 1]
gridDim[32, 16, 1]
compile info: Used 59 registers, 668+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]
The kernel won’t run and it’s giving me this:
cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <file.cu>, line 55 : too many resources requested for launch.
I did not change anything but the total number of threads (and the gridDim of course). Could somebody explain where is the bottleneck?
Here is one other interesting observation. If I force the compiler to push the number of used registers to 16 (using --maxrregcount=16), it suddenly works fine.
[b]
number of threads: 256 * 256
blockDim[8, 16, 1]
gridDim[32, 16, 1]
compile info: Used 16 registers, 1152+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]
This is behind my understanding of CUDA. I know that I have 8192 registers per multiprocessor and they are shared by the active blocks on one multiproc. But I thought that the total number of blocks doesn’t directly affect the requirements on the register resources. What it affects is the total size of local memory (more threads → more stacks → more lmem), but this should not prevent me from executing the kernel either, since the total size of lmem can be at most ~44MB (668 * 256 * 256). In the case with (–maxrregcount=16) it would be even more and it runs!
I also tried to run the kernel with 512 * 512 threads in total, but I didn’t find any working configuration yet. It keeps saying this:
cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <file.cu>, line 55 : invalid configuration argument.
So I’m a bit confused. Anybody can help?
Thank you
–jan
I’m using CUDA 2.1, I run two simple kernels before this one. The overall consumption of device memory is 60/256MB ~ 20%.
Btw. what does the second number (behind +) in the lmem and smem output mean?