The CUDA Programming Guide elaborates on shared, global and register memory, but is very thin on the properties of local memory. So I have some questions about these:
What are the access costs of local memory (.local instructions)
How much local memory is there available? (per thread or per multiprocessor)
Can I force variables to be in local memory to save registers?
The compiler seems very reluctant to put things in local memory, so I have the idea that it must be very slow (like global memory) or limited in some other way.