Threads per block equation

Page 62 of the CUDA manual states that the number of registers available per thread is equal to:

R / (B * ceil(T, 32))

If we assume B to be the minimum required for a successful execution, 1.0, then for a 384 thread execution the maximum number of registers per thread would be: 8192 / (1.0 * ceil(384, 32)) = 8192 / 416 = 19.

However, I have a kernel that uses 20 registers (according to the .cubin) which executes fine with 384 threads. Why is this? I have heard mention of rounding the number of registers up to a multiple of four, but I’m not sure how this fits in.

(I’m trying to construct a heuristic to determine the maximum number of threads per block for a given kernel.)

Mmm, are you sure that ceil(384,32) is 416? I would say it is 384, which gives a register max. of 21, in accordance with both what the occupancy calculator says and your experience :D

You can find the exact formula in the occupancy calculator. registers for an even number of threads are needed.

This was a point of confusion. However, ceil(384,32) = 384 was even less consistent with my set of other results. >.<

Ah, thanks. I didn’t know it was in there, I’ll dig it up!

I said it wrong I think, As far as I remember registers for an even number of warps are needed, with a minimum of 4 warps if I remember correctly.

Excellent, digging through the spreadsheet and rearranging a bit has given me a formula.

R = registers per thread
T = threads per block

T = min(512, floor(8192/(16*R), 4)16)
Or computationally: T = min(512, ((8192/(16
R))&~3)*16)

And it works. :D