Why profiler gives more register usage than theoretical calculation?

For example, I use -maxrregcount option to make each thread use 42 registers. So by theoretical calculation, two block with 256 threads consumes 242256=21504. But profiler gives 22528. That’s why?

The only reason I can think of is that profiler uses these registers to do the countings. Is that right?

For example, I use -maxrregcount option to make each thread use 42 registers. So by theoretical calculation, two block with 256 threads consumes 242256=21504. But profiler gives 22528. That’s why?

The only reason I can think of is that profiler uses these registers to do the countings. Is that right?

The CUDA programming guide section 4.2 gives the formula for number of registers used by a block, which is a little more complex than simple multiplication due to the granularity of register allocation. Nevertheless, I still get the same number of registers (21504) as you compute. Can you also give nvcc the option --ptxas-options=-v so that it will print the actual number of registers after compilation?

The CUDA programming guide section 4.2 gives the formula for number of registers used by a block, which is a little more complex than simple multiplication due to the granularity of register allocation. Nevertheless, I still get the same number of registers (21504) as you compute. Can you also give nvcc the option --ptxas-options=-v so that it will print the actual number of registers after compilation?