regsPerBlock

I have a kernel that uses 21 registers per thread; when I try to start the block of 385 threads to execute this kernel, cudaErrorLaunchOutOfResources is reported. On my card, the number of registers per block is 8192, so this configuration should run, as 385 * 21 = 8085 < 8192 (shared memory usage is not an issue, and this number of threads is obviously less than maximum allowed number of threads per block). Any idea on why is that? While searching the forum, I found only one reference to alike issues (see post #3 there), but without much further insight (for me at least)… For shared memory, I know that 16kB is actually not fully available to threads in a block, because 256 bytes are reserved for passing kernel arguments. Is there any alike hidden usage of registers? Is there any way to actually query these numbers?

Thanks.

The registers have to be allocated for an entire warp (see section 5.2 in the Programming Guide), even if you are only actively using one thread in the warp. Since you are requesting 385 threads per block, that is 12 full warps, and 1 additional warp with 1 active thread. However, that additional warp still needs 32 * 21 = 672 threads, brining the total thread requirement for the block up to 8736.

Can you run with 384 threads instead? That should be fine.

Thanks for the clarification - I was actually writing small piece of the code to calculate max. number of threads per block during the run time; now I can see how to properly incorporate registers-related limit.

note that you can copy the formulas from the occupancy calculator.

Thanks, good tip. Tried it immediately, but - unfortunately cannot follow Excel at all…