What form does local memory take? How many registers per SP?

I’m having a bit of trouble understanding a couple of things.

I’m lead to believe that when I launch a kernel with a number of variables declared, those variables by default use local memory. Where and in what form does this local memory take (registers or something much slower?) I’ve read that accessing global memory takes ~400-600 cycles so I want to avoid that at all cost. I understand how shared memory works, but when using ordinary variables unique to each thread in the kernel, what memory do these variables use and is it fast? Do they use registers? How many registers can I use per SP (streaming processor) core? How many 32-bit floating point variables can I use per thread before they spill over into global memory?

Thanks for the help

Variables use registers by default. If you do not limit the max register count, you could have a maximum of 63(or 62) registers per thread. Fermi has 32*1024 32-bit registers per Streaming Multiprocessor.

Thanks, but where does the 32*1024 number come from? I know that in Fermi you can have 1024 threads per SM, so is that 32 coming from 32 registers per thread?

That number comes from Appendix F of the CUDA C Programming Guide. For these devices, the multiprocessor has 32678 registers, each 32-bits in size.