Page 62 of the CUDA manual states that the number of registers available per thread is equal to:

R / (B * ceil(T, 32))

If we assume B to be the minimum required for a successful execution, 1.0, then for a 384 thread execution the maximum number of registers per thread would be: 8192 / (1.0 * ceil(384, 32)) = 8192 / 416 = 19.

However, I have a kernel that uses 20 registers (according to the .cubin) which executes fine with 384 threads. Why is this? I have heard mention of rounding the number of registers up to a multiple of four, but I’m not sure how this fits in.

(I’m trying to construct a heuristic to determine the maximum number of threads per block for a given kernel.)