I have a kernel which doesn’t use any share memory. Each thread requires 162 threads and i have set the execution configuration as following:
[Using 1D only]
Theoretically the Registers/Block should be
BlockDim x NumOfRegistersPerThread 256*162 = 41472
But the kernel latency section of the profiler outputs the following
Registers/Block = 43008
Here is the screenshot from the profiler
What are the other [43008-41472] registers used for ?