Understanding of Registers/Block entry of the profiler

I have a kernel which doesn’t use any share memory. Each thread requires 162 threads and i have set the execution configuration as following:


[Using 1D only]

Theoretically the Registers/Block should be

BlockDim x NumOfRegistersPerThread
256*162 = 41472

But the kernel latency section of the profiler outputs the following

Registers/Block = 43008

Here is the screenshot from the profiler


What are the other [43008-41472] registers used for ?

Registers often have an allocation granularity. It might be 8 registers granularity.

162/8 is not a whole number. If we round that up to the next whole number, then there are actually 168 registers per thread. The compiler may indicate 162 are actually needed/used, but when they actually get allocated at runtime, they will be allocated according to some granularity, not individually.

168*256 = 43008

Thank you for reminding me :)