Page 62 of the CUDA manual states that the number of registers available per thread is equal to:
R / (B * ceil(T, 32))
If we assume B to be the minimum required for a successful execution, 1.0, then for a 384 thread execution the maximum number of registers per thread would be: 8192 / (1.0 * ceil(384, 32)) = 8192 / 416 = 19.
However, I have a kernel that uses 20 registers (according to the .cubin) which executes fine with 384 threads. Why is this? I have heard mention of rounding the number of registers up to a multiple of four, but I’m not sure how this fits in.
(I’m trying to construct a heuristic to determine the maximum number of threads per block for a given kernel.)
Mmm, are you sure that ceil(384,32) is 416? I would say it is 384, which gives a register max. of 21, in accordance with both what the occupancy calculator says and your experience :D