I just noticed something weird with regard to register usage when passing thread idx into a device kernel. Basically, I have a kernel, lets call it, main_kernel where I am trying to reduce the register usage. So i have a device kernel in my main_kernel, lets call it, sub_kernel where I was originally passing thread id as an argument, this was consuming around 246 registers/thread. But now when I explicitly declare the thread Id inside the device kernel instead of passing it, the register usage reduced to 240.

I am unable to understand whats really happening here, Id be super grateful if anyone can explain this to me.



Just a guess, but either the compiler or possibly ptxas (which does the register allocation) may be able to apply more optimizations in the case where the thread idx is declared locally.


