Figure 2-2 of the CUDA programming guide indicates that each thread has access to a ‘local memory’, but there is no whereelse in the guide explaining the size or location of the local memory. What is it anyway?
When we define a local variable in a global or device function, is it defined in register, or so called ‘local memory’? There are only 8K 32bits registers. If I define 768 threads, which is maxium in each multiprocessor, then each thread can only have 8K/768 = 10 floats. This seems to be very less. If there is more local variable necessary than available registers, will the thread automatically use device memory? If that happens, it sounds that it is going to downgrade the performance signficantly.
The guide says each 8800 card has 16 multiprocessors, while every multiprocessor can run up to 8 blocks. If I define more than 8 blocks in a grid, will the blocks automatically distributed evenly to all multiprocessors? Or will the blocks try to fill one multiprocessor first, and then next …? Or does it not matter at all?