regsPerBlock

sagrailo · September 27, 2008, 5:43pm

I have a kernel that uses 21 registers per thread; when I try to start the block of 385 threads to execute this kernel, cudaErrorLaunchOutOfResources is reported. On my card, the number of registers per block is 8192, so this configuration should run, as 385 * 21 = 8085 < 8192 (shared memory usage is not an issue, and this number of threads is obviously less than maximum allowed number of threads per block). Any idea on why is that? While searching the forum, I found only one reference to alike issues (see post #3 there), but without much further insight (for me at least)… For shared memory, I know that 16kB is actually not fully available to threads in a block, because 256 bytes are reserved for passing kernel arguments. Is there any alike hidden usage of registers? Is there any way to actually query these numbers?

Thanks.

seibert · September 27, 2008, 6:48pm

The registers have to be allocated for an entire warp (see section 5.2 in the Programming Guide), even if you are only actively using one thread in the warp. Since you are requesting 385 threads per block, that is 12 full warps, and 1 additional warp with 1 active thread. However, that additional warp still needs 32 * 21 = 672 threads, brining the total thread requirement for the block up to 8736.

Can you run with 384 threads instead? That should be fine.

sagrailo · September 28, 2008, 6:21am

Thanks for the clarification - I was actually writing small piece of the code to calculate max. number of threads per block during the run time; now I can see how to properly incorporate registers-related limit.

E.D_Riedijk · September 28, 2008, 6:50am

note that you can copy the formulas from the occupancy calculator.

sagrailo · September 28, 2008, 3:53pm

Thanks, good tip. Tried it immediately, but - unfortunately cannot follow Excel at all…

Topic		Replies	Views
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	5086	June 30, 2008
max number of block CUDA Programming and Performance	21	18104	April 20, 2010
Threads per block equation CUDA Programming and Performance	5	3099	April 3, 2008
Registers per thread limit and occupancy CUDA Programming and Performance	3	10170	March 30, 2007
Why is my register count limiting the active thread blocks per SM CUDA Programming and Performance	2	113	February 17, 2025
number of threads and registers CUDA Programming and Performance	10	4975	March 14, 2008
Registers per SM GTX 460 CUDA Programming and Performance	7	2002	April 17, 2011
<500 threads and out of resources? 9600GT should support 512 threads/block CUDA Programming and Performance	9	3646	September 17, 2008
maximum number of blocks CUDA Programming and Performance	3	2440	April 10, 2008
Occupancy Calculation in check but still 'out of resource' error. CUDA Programming and Performance	4	3093	November 15, 2009

regsPerBlock

Related topics