My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:
1.) Compute Capability: 2.0
1.b) Shared Memory Size Config: 49152
2.) Threads Per Block: 672
Registers Per Thread: 48
Shared Memory Per Block: 32272 (which is dynamically allocated)
The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:
Limited by Registers per Multiprocessor: 1
Limited by Shared Memory per Multiprocessor: 1
Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.
My kernel is returning the “too many resources requested for launch” error, but my resources seem to be within limits. In the Occupancy Calculator, I entered:
1.) Compute Capability: 2.0
1.b) Shared Memory Size Config: 49152
2.) Threads Per Block: 672
Registers Per Thread: 48
Shared Memory Per Block: 32272 (which is dynamically allocated)
The Occupancy Calculator displays 32256 registers and 32384 bytes of shared memory, which are OK for my device, but under the section “Maximum Thread Blocks Per Multiprocessor”, two lines are highlighted in red:
Limited by Registers per Multiprocessor: 1
Limited by Shared Memory per Multiprocessor: 1
Sorry if this is a dumb question, but isn’t 1 block per MP OK? If so, the only other reason that the kernel is failing (that I can think of) is that the device is somehow selecting the smaller 16384 byte shared memory configuration. Thanks for any suggestions.
OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?
OK, I’ll try that. I thought that the device/driver selected the shared memory/cache configuration automatically. Is there a reason that it wouldn’t do that?
This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.
Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].
This is not ever a problem. CUDA blocks from the same kernel are not required to all run concurrently. Blocks can remain queued up until other blocks finish, at which point they are launched on SMs as they become available.
Incidentally, this feature, in combination with the current hardware requirement that active blocks run uninterrupted to completion, is one reason why CUDA does not offer a global thread synchronization construct. Such a thing would only make sense for kernels where [# of blocks] <= [# of SMs].
Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?
Seibert, thanks for clearing that up. Are there resources other than the number of registers and the amount of shared memory that could be causing my launch failure?