too many resources requested for launch what does it exactly mean?

Hey there,

Apparently, I’m having sort of a problem with resources. My ancient GPU (8600M, capability 1.1) refuses to launch a kernel.

First, let me mention an example, where everything is fine…

[b]

number of threads: 128 * 128

blockDim[8, 16, 1]

gridDim[16, 8, 1]

compile info: Used 59 registers, 668+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]

The kernel is quite big and each thread uses a stack to traverse a tree. That is the reason for the high number of bytes in local memory. The number of registers is embarrassingly huge, but I still haven’t proceed to the stage of optimization. The important fact is, that this configuration works just fine.

The problem comes up when I increase the total number of threads to 256 * 256.

[b]

number of threads: 256 * 256

blockDim[8, 16, 1]

gridDim[32, 16, 1]

compile info: Used 59 registers, 668+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]

The kernel won’t run and it’s giving me this:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <file.cu>, line 55 : too many resources requested for launch.

I did not change anything but the total number of threads (and the gridDim of course). Could somebody explain where is the bottleneck?

Here is one other interesting observation. If I force the compiler to push the number of used registers to 16 (using --maxrregcount=16), it suddenly works fine.

[b]

number of threads: 256 * 256

blockDim[8, 16, 1]

gridDim[32, 16, 1]

compile info: Used 16 registers, 1152+512 bytes lmem, 84+80 bytes smem, 184 bytes cmem[1][/b]

This is behind my understanding of CUDA. I know that I have 8192 registers per multiprocessor and they are shared by the active blocks on one multiproc. But I thought that the total number of blocks doesn’t directly affect the requirements on the register resources. What it affects is the total size of local memory (more threads → more stacks → more lmem), but this should not prevent me from executing the kernel either, since the total size of lmem can be at most ~44MB (668 * 256 * 256). In the case with (–maxrregcount=16) it would be even more and it runs!

I also tried to run the kernel with 512 * 512 threads in total, but I didn’t find any working configuration yet. It keeps saying this:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <file.cu>, line 55 : invalid configuration argument.

So I’m a bit confused. Anybody can help?

Thank you

–jan

I’m using CUDA 2.1, I run two simple kernels before this one. The overall consumption of device memory is 60/256MB ~ 20%.

Btw. what does the second number (behind +) in the lmem and smem output mean?

umm… no. The kernel fails to launch if the number of registers required for a block to launch exceed the number of registers available. It is there in the documentation. I dont know if there are upper limits on local memory.

EDIT : Ok. I am not sure why in the second case the kernel fails to launch. But the requirement for number of registers still stands. In your case you are not violating it though only 1 block will be active on a multiprocessor at any time. My reply is redundant.

No idea about this one

If I understand you correctly, you have 8 * 16 = 128 threads per block, each using 59 registers for a total of 7,552 registers per block.

Do you have a typo here? I assume from your comments that you mean to suggest you have 256 threads per block (e.g., blockDim[16, 16, 1]), each using 59 registers. That would be 15,104 registers which exceeds the 8K limit and hence will not launch.

Finally, I assume you again have a typo in the blockDim statement, but you mean that you have 256 threads per block, each using 16 registers. That makes a total of 4096, easily within the limits.

Wel… I did not have the type in there (I was copying the values from program output), but the problem was right in the next step.

I called the kernel with blockDim and gridDim swapped:

kernel<<<blockDim, gridDim>>>(…);

instead of

kernel<<<gridDim, blockDim>>>(…);

That’s the reason for all that strange behaviour.

grrrrr… In those kernels I run before, everything is correct, just this one. That’s the punishment for not going the fast “copy&paste” way. I just typed the command, cause I thought I have already learned the syntax and I made the worst type I could ever come up :)

Thanks a lot CudaSpeak, apologises to my GPU that I have cursed so many times.