I have a kernel which doesn’t use any share memory. Each thread requires 162 threads and i have set the execution configuration as following:
BlockDim.x:256
GridDim.x:3
[Using 1D only]
Theoretically the Registers/Block should be
BlockDim x NumOfRegistersPerThread
256*162 = 41472
But the kernel latency section of the profiler outputs the following
Registers/Block = 43008
Here is the screenshot from the profiler
https://www.dropbox.com/s/lo0t857l9p1ot8h/profiler.png?dl=0
What are the other [43008-41472] registers used for ?