too many resources requested for launch

kmccall · November 26, 2010, 7:17pm

My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:

1.) Compute Capability: 2.0

1.b) Shared Memory Size Config: 49152

2.) Threads Per Block: 672

Registers Per Thread: 48

Shared Memory Per Block: 32272  (which is dynamically allocated)

The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:

Limited by Registers per Multiprocessor: 1

Limited by Shared Memory per Multiprocessor: 1

Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.

kmccall · November 26, 2010, 7:17pm

My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:

1.) Compute Capability: 2.0

1.b) Shared Memory Size Config: 49152

2.) Threads Per Block: 672

Registers Per Thread: 48

Shared Memory Per Block: 32272  (which is dynamically allocated)

The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:

Limited by Registers per Multiprocessor: 1

Limited by Shared Memory per Multiprocessor: 1

Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.

hyqneuron · November 26, 2010, 8:35pm

is your #block > #SM?

hyqneuron · November 26, 2010, 8:35pm

is your #block > #SM?

kmccall · November 26, 2010, 8:41pm

If I understand you, the kernel runs with 8 blocks, and my GTX 480 has 15 streaming multiprocessors. Why is that important?

kmccall · November 26, 2010, 8:41pm

If I understand you, the kernel runs with 8 blocks, and my GTX 480 has 15 streaming multiprocessors. Why is that important?

hyqneuron · November 26, 2010, 8:44pm

Well then that’s not the problem. But if two blocks are squeezed to the same SM it may not launch.

hyqneuron · November 26, 2010, 8:44pm

Well then that’s not the problem. But if two blocks are squeezed to the same SM it may not launch.

hyqneuron · November 26, 2010, 8:51pm

Have you tried using cudaThreadSetCacheConfig or cudaFuncSetCacheConfig to explicitly set the cache config?

hyqneuron · November 26, 2010, 8:51pm

Have you tried using cudaThreadSetCacheConfig or cudaFuncSetCacheConfig to explicitly set the cache config?

kmccall · November 26, 2010, 8:54pm

OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?

kmccall · November 26, 2010, 8:54pm

OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?

kmccall · November 26, 2010, 9:28pm

cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared) didn’t make any difference – I’m still getting the insufficient resources error.

kmccall · November 26, 2010, 9:28pm

cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared) didn’t make any difference – I’m still getting the insufficient resources error.

seibert · November 26, 2010, 9:29pm

This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.

Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].

seibert · November 26, 2010, 9:29pm

This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.

Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].

kmccall · November 26, 2010, 9:51pm

Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?

kmccall · November 26, 2010, 9:51pm

Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?

seibert · November 27, 2010, 3:16pm

Are you compiling with the -arch=sm_20 flag? (Not sure if that is required, but could be given the register needs of your kernel.)

seibert · November 27, 2010, 3:16pm

Are you compiling with the -arch=sm_20 flag? (Not sure if that is required, but could be given the register needs of your kernel.)