I’m having a bit of trouble understanding a couple of things.
I’m lead to believe that when I launch a kernel with a number of variables declared, those variables by default use local memory. Where and in what form does this local memory take (registers or something much slower?) I’ve read that accessing global memory takes ~400-600 cycles so I want to avoid that at all cost. I understand how shared memory works, but when using ordinary variables unique to each thread in the kernel, what memory do these variables use and is it fast? Do they use registers? How many registers can I use per SP (streaming processor) core? How many 32-bit floating point variables can I use per thread before they spill over into global memory?
Thanks for the help