questions on register, local memory and block

Figure 2-2 of the CUDA programming guide indicates that each thread has access to a ‘local memory’, but there is no whereelse in the guide explaining the size or location of the local memory. What is it anyway?

When we define a local variable in a global or device function, is it defined in register, or so called ‘local memory’? There are only 8K 32bits registers. If I define 768 threads, which is maxium in each multiprocessor, then each thread can only have 8K/768 = 10 floats. This seems to be very less. If there is more local variable necessary than available registers, will the thread automatically use device memory? If that happens, it sounds that it is going to downgrade the performance signficantly.

The guide says each 8800 card has 16 multiprocessors, while every multiprocessor can run up to 8 blocks. If I define more than 8 blocks in a grid, will the blocks automatically distributed evenly to all multiprocessors? Or will the blocks try to fill one multiprocessor first, and then next …? Or does it not matter at all?

Thank you,

Yes, if ptxas runs out of registers it will automatically use local memory. Local memory is global memory, but allocated so that each thread has an area of itself.

How is local memory equal to global memory? I have done performance benchmarks and I have seen that local memory outperforms global memory.

How much difference are you seeing? Can you post sample code?

Local memory is a thread-specific global memory, this is also explained in the programmers documentation. I don’t know why you get different timings, make sure you do the coalescing right, as local memory is probably laid out so that reads and writes are coalesced.

Well, keep in mind occupancy of the multiprocessors. If the cuda code uses 36-40 registers, then by limiting total number of registers to 32 (via -maxrregcount=32) will result of spilling some of the data to lmem, which is darn slow zince this is global memory. But 32 registers result in large occupancy than 36 registers, and the former hides memory latency better than the latter; therefore, then net result is slightly higher performance, even if lmem is in use.

However, if too many registers are used per thread in the cuda code (say 50-60) and one limits this to 32, the performance will degrade quite notably.

The problem of decreasing number of registers could be solved by a clever use of shared memory, in which one can store intermediate results, as the access to shared memory is as fast as to registers, if I am not mistaken. The actual details of the implementation depend on the problem at hand.