No. The register file and shared memory are separate areas of silicon, and their function is different. All explained in the programming guide.
Only in shared memory if you used a shared declaration for the local variable. Also explained in the programming guide.
By definition, there are no explicit arguments in an inline expanded function - it works much more like macro substitution. Local variables go to local memory or register, just like any other piece of kernel code.
The compile never stores kernel local variables in global memory. Everything goes either in local memory or registers.
How do I determine whether local variables are stored in registers?
When you are using ‘nvcc’ for compiling your code, just pass this option: ‘-Xptxas -v’. This will print how many registers are used by this kernel, what is the local memory usage (if any), what are the shared and constant memory usages.
I have a kernel function which has ~100 local variables (of type float) and so, compiles to use 58 registers. The occupancy is low and so, in an attempt to remedy this, I replaced a whole lot of those variables with #defines - leaving me with ~15. However, the register usage is still 52!?
I can’t undestand the PTX output.
Can anyone point me in the right direction? Is there a standard approach?
What do you mean you replaced variables with #defines? Keep in mind that the number of variables you have is only loosely correlated with the number of registers your kernel uses. Compilers are smart enough to figure out when a variable is no longer needed, and reuse the register assigned to it. (Conversely, a complex expression may require additional registers for intermediate calculations.)
Also, you should realize that register assignment is done after the PTX generation stage. If you look at the PTX code, you’ll see a huge number of registers used, because the compiler emits PTX in static single assignment form. When ptxas converts the PTX to the cubin GPU machine code format, it maps the registers used in the PTX code to actual hardware registers.
Reducing register usage is tricky, and I don’t have any good heuristics. You can try forcing the compiler to spill registers to local memory with the --maxrregcount option to nvcc. Because local memory (basically global memory assigned to each thread) is much slower, this can make things worse, but it is worth a try.
Variables where only being set once and then referred to multiple times. So, as you suggest, the compiler could probably be smart about this. [But I know nothing of how compilers work and as such don’t like to trust them!]
This increases compilation time drastically! I set maxregcount to 30, to test, and my program had not compiled 1 hour later… Smaller tests (eg. maxregcount 50) produced code which ran slower.
Good to know it’s not just me! But, yes, I was really hoping for some heuristics.
There is a limit of 8192 registers per MP for compute 1.0/1.1 and 16384 for compute 1.2/1.3. If you are trying to launch 416 threads per block with 47 registers per thread, that is 19552 total registers, which exceeds the register limits of any currently available version of the hardware. That is why you are getting the too many resources error.