In CUDA, each multiprocessor has 16Kb of on chip shared memory.
Is this 16Kb split to all thread blocks scheduled to be executed on this multiprocessor? E.g., in Tesla C1060, I have 30 multiprocessors.
My kernel has 120 blocks, 4 blocks are assigned to 1 processor. Does it mean I can only allocate 4Kb of shared memory per block at maximum?
Even though these 4 blocks won’t be executing programs at the same time on one processor.
Another possibility is 1 block can occupy all the 16Kb shared memory when it is doing computation on that processor. But if so, when this block is stopped in the
middle of the computation (maybe waiting for memory load from global memory ) and kicked out of the multiprocessor,
another block will be activated to execute instructions on this processor, then will the current values of shared memory for the previous block be backed up?
Will these values be copied out first, then the new block overwrite all the values in shared memory, when the previous block is activated again, the old values will
be copied back.
I didn’t find any clear explanations on this issue from programming guide. One sentence in Section 4.1 said, “If there are not enough registers or shared memory available per
multiprocessor to process at least one block, the kernel will fail to launch.” Does this imply that as long as the shared memory can satisfy one block, the kernel can be launched?
like the second possibility I mentioned.