The programming guide 3.0 mentions in section G.1 about “local memory per thread” to be 16KB.
My question is why the local memory is per thread ?
How can local memory be dependent on no. of threads ?
Amount of local memory must be fixed. So it should be something like local memory per SM or per SP.
The register memory is mentioned as " No of 32-bit registers per SM " which is 16 K.
Since Register memory is local to each SP, 2 K registers are available per SP.
So when a thread executes on an SP, the memory available to it is 2k . Is that correct?
Another question I have is, whether there is one local memory per SM OR one local memory per SP like the register file.
Registers are not local to each SP. Each active thread on a SM is statically allocated the registers it requires. (nvcc --ptx-options=-v will show you how many registers per thread your kernel uses)
Since you should be running way more threads per block than there are SPs, the register storage available to each thread is much less than 2 kB.
No. There is at least also instruction memory and (if the card is used for video) the video buffer. Memory might be used for other purposes as well (uploadable firmware etc.). Only Nvidia knows, I guess.