too many resources requested for launch

My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:

1.) Compute Capability: 2.0

1.b) Shared Memory Size Config: 49152

2.) Threads Per Block: 672

Registers Per Thread: 48

Shared Memory Per Block: 32272  (which is dynamically allocated)

The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:

Limited by Registers per Multiprocessor: 1

Limited by Shared Memory per Multiprocessor: 1

Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.

My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:

1.) Compute Capability: 2.0

1.b) Shared Memory Size Config: 49152

2.) Threads Per Block: 672

Registers Per Thread: 48

Shared Memory Per Block: 32272  (which is dynamically allocated)

The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:

Limited by Registers per Multiprocessor: 1

Limited by Shared Memory per Multiprocessor: 1

Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.

is your #block > #SM?

is your #block > #SM?

If I understand you, the kernel runs with 8 blocks, and my GTX 480 has 15 streaming multiprocessors. Why is that important?

If I understand you, the kernel runs with 8 blocks, and my GTX 480 has 15 streaming multiprocessors. Why is that important?

Well then that’s not the problem. But if two blocks are squeezed to the same SM it may not launch.

Well then that’s not the problem. But if two blocks are squeezed to the same SM it may not launch.

Have you tried using cudaThreadSetCacheConfig or cudaFuncSetCacheConfig to explicitly set the cache config?

Have you tried using cudaThreadSetCacheConfig or cudaFuncSetCacheConfig to explicitly set the cache config?

OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?

OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?

cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared) didn’t make any difference – I’m still getting the insufficient resources error.

cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared) didn’t make any difference – I’m still getting the insufficient resources error.

This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.

Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].

This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.

Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].

Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?

Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?

Are you compiling with the -arch=sm_20 flag? (Not sure if that is required, but could be given the register needs of your kernel.)

Are you compiling with the -arch=sm_20 flag? (Not sure if that is required, but could be given the register needs of your kernel.)