Kernel Execution issues related to Shared Memory

Hi, all,

In CUDA, each multiprocessor has 16Kb of on chip shared memory.

Is this 16Kb split to all thread blocks scheduled to be executed on this multiprocessor? E.g., in Tesla C1060, I have 30 multiprocessors.
My kernel has 120 blocks, 4 blocks are assigned to 1 processor. Does it mean I can only allocate 4Kb of shared memory per block at maximum?
Even though these 4 blocks won’t be executing programs at the same time on one processor.

Another possibility is 1 block can occupy all the 16Kb shared memory when it is doing computation on that processor. But if so, when this block is stopped in the
middle of the computation (maybe waiting for memory load from global memory ) and kicked out of the multiprocessor,
another block will be activated to execute instructions on this processor, then will the current values of shared memory for the previous block be backed up?
Will these values be copied out first, then the new block overwrite all the values in shared memory, when the previous block is activated again, the old values will
be copied back.

I didn’t find any clear explanations on this issue from programming guide. One sentence in Section 4.1 said, “If there are not enough registers or shared memory available per
multiprocessor to process at least one block, the kernel will fail to launch.” Does this imply that as long as the shared memory can satisfy one block, the kernel can be launched?
like the second possibility I mentioned.

Thank you,

Not all blocks have to be resident at once. If you only use 4KB of shared memory and less than 1/4 of other resources per SM per block in your first example, all 120 blocks could run in parallel (not guaranteed). If you use more than that, fewer blocks will run concurrently, and as a block finishes another block will launch to take its spot.

The second case is impossible; once a block has been assigned to an SM, it does not leave the SM until it’s completed.

That is what happens.

That isn’t what happens. Blocks aren’t paged in and out of running multiprocessors. They are scheduled and run until completion. Scheduling, memory access, instruction pipelining, and execution all happen at the warp or half warp level, rather that at the block level. It is warps which are context switched in and out of any given multiprocessor, not whole blocks.

Thank you for the reply. This clarifies my confusion.

Thank you for your clear explanation. I thought scheduling can happen at the block level.

From your reply, I think the blocks are mapped to multiprocessors at the beginning of kernel launch. As shown in Figure 4-1 of programming guide,

if I have 30 multiprocessors, then block0, 30, 60,… will be assigned to SM0 for execution; block1, 31, 61, … will be on SM1.

Will this be fixed during the kernel execution? If the blocks on SM1 finishes first, will the hardware move the blocks on SM0 onto SM1?

Thank you,

Blocks are never ever moved for any reason. Once they’re assigned, that’s it, the end.

The ordering is not guaranteed.