ERROR: too many resources requested for launch.

Hey, there,

I got some error information as:

Kernel execution failed!!! in file <gpuforward .cu>, line 101 : too many resources requested for launch.

I did a google search and found it may be caused by too many registers or shared memory required. I believe it’s due to too the registers as I remove some of them, the error is gone. However, I donot understand why. I’m using Tesla 1060. Running the device Query shows It has 16384 registers per multiprocessor. I set each block with 512 threads. So if each multiprocessor has two blocks, each thread should be able to use 16384/1024=16 registers. But actually, I used only 14 registers in my kernel. And it seems I can only use 10 of them.

Moreover, I can claim the 11th register without using it. Once I use it. It gives the same error as above.

Did I misunderstand the limits on registers I can use?

Thank you very much.

Even if you used more than 16 registers/thread, the block scheduler would only run one block at a time per multiprocessor and you would still be fine.

The three possible culprits for a resource error are:

  • Too large a block or grid size.
  • Too many registers
  • Too much shared memory

You’ve mentioned the registers, which don’t seem like a problem. And you’ve mentioned that the block size is 512 (I assume that is 1D). How big is your grid, and how much shared memory do you use?

Thank you for your reply.

yes, the block size is 1D. grid size is also 1D. 360 blocks/grid.

I used 4096 bytes in shared memory. Tesla 1060 has 16384 bytes shared memory per multiprocessor. So I think it’s OK. I tried to reduce the shared memory usage, the error exists. So I think shared memory is not the problem.

Now I reduce the number of threads in each block. It works fine, but it’s slower.

Is there any restrictions on the grid size?

Yes, but the grid restriction is very large (65535x65535), so you are nowhere near that either.

I’m stumped now. Your grid size, block size, register usage and shared memory usage should be well within the limits of a C1060 (and even within the range of an older C870). I have no idea what other resources CUDA could be lacking…

How many arguments do you have to the failing kernel?

I have a kernel with 32 arguments that fails to launch. Reducing arguments to 30 makes it launch successfully.

I limit the number of registers to 64, shared memory is within the limits, and so it the grid and block size. The only strange thing is the use of cmem[13]?! Here’s the ptxas output:

ptxas info : Used 64 registers, 320+0 bytes lmem, 240+16 bytes smem, 4084 bytes cmem[0], 140 bytes cmem[1], 20 bytes cmem[13]

The exact error is:

Cuda call at test.cu:1032 fails with: 7: too many resources requested for launch

Which resources is that?

Since the limit on the arguments is 256 bytes (from the programming guide: global function parameters are currently passed via shared memory to the
device and limited to 256 bytes
,) 16 of which might be the execution configuration (in some packed form?!), we are left with 240 bytes for the actual arguments. Assuming each argument takes 8 bytes (e.g. sizeof(int*) is 8 on 64 bit machine,) it comes out that we can pass only 30 arguments (when they are (most of them) pointers.) This is just an educated speculation.

My target device is a Tesla C1060. CUDA toolkit version 2.3, driver version 190.16, running on Fedora 9, Linux kernel version 2.6.27.25-78.2.56.fc9.x86_64, 64-bit.

Edin Hodzic

from result of decuda, formal parameter of a kernel function starts at 0x0010 of shared memory.

0x0000 ~ 0x000f (16 bytes) contains 8 objects, each has 2 bytes, including

blockDim.x , blockDim.y , blockDim. z

gridDim.x , gridDim.y , gridDim.z

blockIdx.x , blockIdx.y

Very good.

It is also interesting that the same kernel with 32 arguments we got works on a 32-bit XP but fails to launch on a 64-bit Linux. The reason is perhaps sizeof(void*) and __alignof(void*) on the former machine is 4, whereas on the latter it is 8. The 32-bit machine can pack 240/4=60 arguments while the 64-bit machine can pack only 240/8=30 (pointer) arguments to a kernel.

Edin Hodzic

The thing is that , I guess the ptx code is further optimized before its executed which maybe increasing your register usage by 1 or 2. Trying using the maxxregcount feature and restrict register usage to 11… you still get the error ?

Some issue with firefox// reposted the post