For example, I use -maxrregcount option to make each thread use 42 registers. So by theoretical calculation, two block with 256 threads consumes 242256=21504. But profiler gives 22528. That’s why?
The only reason I can think of is that profiler uses these registers to do the countings. Is that right?